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PHONOLOGICAL AWARENESS: THE ROLE OF READING EXPERIENCE* 



Virginia A. Mannt 



Abstract , A cross-cultural study of Japanese and American children 
has examined the development of awareness about syllables and 
phonemes. Using counting tests and deletion tests, Experiments I 
and III reveal that in contrast to first graders in America, most of 
whom tend to be aware of both syllables and phonemes, almost all 
first graders in Japan are aware of niora (phonological units roughly 
equivalent to syllables), but relatively few are aware of phonemes. 
This difference in phonological awareness may be attributed to the 
fact that Japanese first graders learn to read a syllabar-y, whereas 
American first graders learn to read an alphabet. For most children 
at this age, awareness of phonemefj may require experience with 
alphabetic transcription, whereas awareness of syllables may be 
facilitated by experience with a syllabary, but be less dependent 
upon it. To clarify further the role of knowledge of an alphabet on 
children's awareness of phonemes. Experiments II and IV administered 
the same counting and deletion tests to Japanese children in the 
later elementary grades. Here the data reveal that many Japanese 
children become aware of phonemes by age ten whether or not they 
have received instruction in alphabetic transcription. Discussion 
of these results focuses on some of the other factors that may 
promote phonological awareness. 

Introduction 

The primary language activities of listening and speaking do not require 
an explicit awareness of the internal phonological structure of words any more 
than they require an explicit awareness of the rules of syntax. Yet a 
"metalinguistic" awareness that words comprise syllables and phonemes is 
precisely what is needed when language users turn from the primary language 
activities of opeaking and listening to the secondary language activities of 
reading, versification, and word games (Liberman, 1971; Mattingly, 1972, 



^Cognition, in press. 
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198^). While all members of a given community become speakers and hearers, 
not all become readers, nor do they all play word ga.nes or appreciate verse. 
This difference raises the possibility that the development of phonological 
awareness might require some special cultivating experience above and beyond 
that which supports primary language acquisition. 

Several different research groups have reported that adults who cannot 
read an alphr^ijetic orthography are unable to manipulate phonemes (Byrne & 
Ledez, 1986; Liberman, Rubin, Duqufes, & Carlisle, 1985; Morals, Gary, Alegria, 
& Bertelson, 1979; Read, Zhang, Nie, & Ding, 198^4), raising the possibility 
that knowledge of the alphabet is essential to awareness of phonemes. In 
further pursuit of the factors that give rise to phonological awareness, the 
present study has explored thr awareness of syllables and phonemes among 
Japanese children and American children. This particular cross-linguistic 
comparison is prompted by certain differences between the English and Japanese 
orthographies, and by certain differences in the word games and versification 
devices that are available to children in the two language communities. 

Children in America learn to read the English orthography, an alphabet 
that represents spoken language at the level of the phoneme. Many of them 
also play phoneme-based word games such as "pig-Latin" and "Geography," and 
learn to employ versification devices such as alliteration that Involve 
manipulations of phonemes, as well as word games and versification devices 
that exploit meter and thus operate on syllable-sized units. In contrast, 
virtually all of the secondary language activities that are available to 
Japanese children manipulate mora — phonological units that are roughly 
equivalent to syllables — if they manipulate phonological structure at all. 
Japanese children learn to read an orthography that comprises two types of 
transcription: Kanji, a morphology-based system, and Kana, a phonology-based 
system. Kanji is derived from the Chinese logography and represents the roots 
of words without regard to grammatical inflections, whereas Kana is of native 
origin and comprises two syllabaries, Hiragana and Katakana, which can 
represent the root and inflection of any word in terms of their constituent 
mora. Typically, the two orthographies function together, with Kanji 
representing most word roots and Kana representing all word inflections and 
the roots of those words that lack Kanji characters. As for other secondary 
language activities, Japanese word games such as "Shiritori" (a mcra-based 
equivalent of "Geography") and versification • devices such as Haiku manipulate 
mora. 

In short, Japanese secondary language activities do not manipulate 
language at the level of the phoneme, whereas several English secondary 
language activities are phoneme-based, most notably the alphabetic 
orthography. Both Japanese and English afford versification devices and word 
games that manipulate syllable- sized units, but the Japanese orthography is 
unique in its inclusion of a syllabary. Given these similarities and 
differences between the orthographies and other secondary language activities 
in English and Japanese, it may be reasoned that, if experience with secondary 
language activities plays a specific role in the development of awareness 
about syllables and phonemes, Japanese children should be aware of mora 
(syllables), whereas American children should be aware of both phonemes and 
syllables. Should the experience of learning to read a given type of 
orthography play a particularly critical factor, Japanese children should be 
more aware of syllables than their American counterparts, who should be more 
aware of phonemes. It seems unlikely that the pos3ession of primary language 
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skills is sufficient to make Japanese and American children equivalent in 
awareness of phonemes, given findings that alphabet-illiterate adults are not 
aware of phonemes. However, it remains possible that children in the two 
countries will be equivalent in phonological awareness should reading 
experience or some other form of secondary language experience that draws the 
child's attention to the phonological structure of language promote the 
awareness of both syllables and phonemes. 

The possibility that reading experience plays a particularly important 
role in the development of phonological awareness arises from the many studies 
that reveal an association betvjeen phonological awareness and success in 
learning to read an alphabetic orthography. These reveal that performance on 
tasks that require manipulations of phonological structure not only 
distinguishes good and poor readers in the early elementary grades (see, for 
example, Alegria, Pignot, & Morais, 1982; Fox & Routh, 1976; Katz, 1982; 
Liberman, 1973; Rosner & Simon, 1973) but also correlates with children*s 
scores on standard reading tests (see, for example, Calfee, Lindamood, & 
Lindamood, 1973; Fox & Rout^, 1976; Perfetti, 1985; Stanovich, Cunningham, & 
Freeman, 198iJb; Treiman & Baron, 1983). 

In many studies of reading ability and phonological awareness, the 
question of cause and effect has been broached, but never completely resolved. 
One of the earliest studies revealed that American children's awareness of 
phonological structure markedly improves at just that age when they are 
beginning to read (Liberman, Shankweiler, Fischer, & Carter, 197^): Among a 
sample of four, five, and six-year-olds, none of the youngest children could 
identify the number of phonemes in a spoken word, while half could identify 
the number of syllables; of the f ive-year^-olds, 17 percent could count 
phonemes while, again, half could count syllables. Most dramatically, 70 
percent of the six-year-olds could count phonemes and 90 percent could count 
syllables. Did the older children become aware of syllables and phonemes 
because they were learning to read, was the opposite true, or both? 

Certain evidence suggests that phonological awareness can precede reading 
ability or develop independently. First of all, various measures of phoneme 
awareness and syllable awareness are capable of presaging the success with 
which preliterate kindergarten children will learn to read the alphabet in the 
first grade (see, for example, Bradley & Bryant, 1983; Helfgotc, 1976; 
Jusczyk, 1977; Liberman et al., 197^; Lundberg, Oloffson, & Wall, 1980; Mann, 
198iJ; Mann & Liberman, 198^*; Stanovich, Cunningham, & Cramer, ^98^a). Second, 
there is evidence that explicit training in the ability to manipulate phonemes 
can facilitate preliterate children's ability to learn to read (Bradley & 
f-ryant, 1985). Third, the awareness of syllables, in particular, does not 
appear to depend upon reading experience, as the majority of preliterate 
children can manipulate syllables by age six without having been instructed in 
the use of a syllabary or an alphabet (Amano, 1970; Liberman et al., i97ii; 
Mann & Liberman, 198i|), and the ability to manipulate syllables is not 
strongly influenced by the kind of reading instruction, "whole-word" or 
"phonics," that children receive in the first grade (Alegria et al., 1982). 

Other evidence, however, has revealed that at least one canponent of 
phonological awareness — awareness cf phonemes — may depend on knowledge of an 
alphabet. As noted previously, seve,''al different investigators have reported 
that the ability to manipulate p'.ionemes is markedly deficient in adu? ts who 
cannot read alphabetic transcription. Awareness of phonemes is dei'icient 
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among semi-literate American adults (Liberman et al., 1985), reading-disabled 
Australjan adults (Byrne & Ledez, 1986), illiterate Portugese adults (Morals 
et al., 1979), and Chinese adults who can read only the Chinese logographic 
orthography (Read et al., 1984). In addition, the type of reading instruction 
that children receive can influence the extent of their awareness: 
first-graders who have been taught to read the alphabet by a "phonics" 
approach tend to be more aware of phonemes than those who have learned by a 
^'whole-word" method (Alegria et al., 1982). 

Present evidence, then, suggests that the relationship between 
phonological awareness and reading ability is a two-way street (Perfetti, 
1985), which may depend on the level of awareness being addressed. Awareness 
of syllables is not very dependent on reading experience and could be a 
natural cognitive achievement of sorts, whereas awareness of phonemes may 
depend upon the experience of learning to read the alphabet, in general, and 
on methods of instruction that draw attention to phonemic structure, in 
particular. As a test of this view, the present study examined the phoneme 
and syllable awareness of children in a Japanese elementary school, predicting 
that these children would be aware of syllables, but would not be aware of 
phonemes until that point in their education when they receive instruction in 
the use of alphabetic transcription. 

The design of the study Involves four experiments that focus on the 
awareness of syllables (mor^^: and phonemes among children at different ages. 
Two different experimental paradigms are employed as a control against any 
confounding effects of task-specific variables. One paradigm is the counting 
test developed by Liberman and her colleagues, a test used In several studies 
of phonological awareness among American children (see, for example, Liberman 
et al., 1974; Mann & Liberman, 1984). The other Is a deletion task, much like 
that e,nployed by Morals et al. (1979) and Read et al. (1984) in their studies 
of alphabet-illiterate adults. 

Experiment I used the counting test paradigm to study Japanese 
first-graders who had recently mastered the Kana syllabaries. To clarify the 
impact of knowledge of a syllabary vs. an alphabet, the results are compared 
with those reported In Liberman et al.'s (1974) study of American first 
graders. The relation between reading and phonological awareness Is also 
probed by an analysis of the relation between phoneme and syllable counting 
performance and the ability to read Hlragana, in which case a nonllngulstlc 
counting test guards against the possibility that any correlations might 
reflect attention capacity, general intelligence, etc. To further clarify the 
role of knowleoge of the alphabet. Experiment II extended use of the counting 
test paradigm to Japanese 'Children in the third to sixth grades. In Japan, 
children routinely receive some instruction in alphabetic transcription 
(Ronaji) at the end of the fourth grade. There also exist certain "re-entry" 
programs for fourth through sixth graders who have spent the first few years 
of their education abroad and who have learned to read an alphabetic 
orthography. Comparisons among the re-entering pupils and normal pupils at 
various grade levels clarifies the relative contribution of alphabetic 
knowledge vs. knowledge of Kana and Kanji. 

Experiment III used the deletion test paradigm to replicate and extend 
the findings of Experiment I. Aside from the chance in procedure, its major 
innovation v:as to employ nonsense words as stimuli, constructing them in a 
fashion to permit parallel testing of first graders in Japan and in America, 
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Analysis of the results concerns performance on each deletion test in relation 
to reading experience and reading ability. Finally, Experiment IV used the 
same paradigm in a partial replication of Experiment II, comparing Japanese 
fourth graders who had not received instruction in Romaji with sixth gr: 'ors 
who had been taught about Romaji one and a half years prior to the test 
session. 

Experiment I 
Methods 

Subjects 

The subjects were ^0 children attending the first grade of the primary 
school attached to Ochanomlzu University, twenty girls and twenty boys' chosen 
at random from the available population and serving with the permission of 
their parents and teachers. Mean age was months at the time of testing, 

which was the beginning of the second trimester of the school year. As a 
measure of Hiragana reading ability, each child rapidly read aloud a list of 
thirty high-frequency nouns, adjectives, and verbs (Sasanuma, 1978), and the 
total reading time and the number of errors were recorded. Each child was 
also rated by his or her teacher as above-average, average, or below-a^'erage 
in Kana reading ability. 

Materials 

The experiment employed three sets of materials designed to measure the 
ability to count three types of items: mora, phonemes, and 30° anr^les (a 
nonlinguistic unit). All three sets were modeled after the materials of 
Llberman et al. (197^): Each contained four series of training items that 
offered the child an opportunity to deduce the nature of the unit being 
counted, followed by a sequence of test items. In the mora counting test and 
phoneme counting test, all training and test items were common Japanese words 
that had been judged by four informants (a linguist, a speech scientist, a 
teacher of Japanese, and a librarian) to be readily familiar to young 
children. In the angle counting test, the items were simple line drawings of 
abstract designs and common objects. A more complete description of each test 
follows* 



Mora counting test. Mora are rhythmic units of the Japanese language 
that more-or-less correspond to syllables. Each mora is either an isolated 
vowel, a vowel preceded by a consonant, an isolated [n], or the first 
consonant in a geminate cluster. A basic difference between mora and English 
syllables is that mora cannot contain consonant clusters, in general p or 
consonants in final position. It is further the case that a single syllable 
of English may correspond to two mora of Japanese. This owes to the fact 
that, in a Japanese word such as hon, [n] can be a mora, whereas [n] cannot be 
a syllable of English, and to the fact that differences in vowel duration (one 
or two mora) and consonant closure duration (normal or an extra mora) 
distinguish minimal pairs of Japanese words but are not contrastive in 
English. 

In the mora-counting test, each training series contained three words: 
two-, three- and four-mora in length. Within the first three series, the 
words formed a progressive sequence, as in hito (man), hitotsu (one). 
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hltotsubu (a grain or drop), but the words of the fourth series bore no such 
relation to each other [i.e., ima (now), kitte (stamp), chiisai (small)]. To 
introduce some of the complexities of Japanese phonology, the third series 
included a devoiced vowel, and the fourth included a long vowel and a geminate 
consonant. To avoid biasing the child's decision as to whether the task was 
to count the mora in a word (a phonological strategy) or the number of Kana 
characters needed to spell the word (a spelling strategy), the training items 
included only those mora that are spelled with a single character. Thus it 
was left ambiguous whether the task was to count orthographic units, or 
phonological ones. 

The test sequence consisted of 1^ two-mora words, 1^ three-mora words, 
and 14 four-mora words presented in a fixed random order. They represented 
common combinations of mora including the nasal mora, geminate vowels, 
geminate consonants, and devoiced vowels. There were four VV words, two CVV 
words, six CVCV words and two CVC words in the two-mora pool; two VCVV words, 
two VVCV words, two CWCV words, three CVCVCV words, two CVCVC words, two 
CVCCV words, and one CVCVV word in the three-mora pool, and four VCVCVCV 
words, two VCCVCV words, one VCVCCV word, four CVCVCVCV words, two CVCVCVV 
words, and one CVCCVCV word in the four-mora pool. As a probe for whether 
children were counting mora or orthographic units, three of the test items 
included one of the Japanese mora spelled with two characters. 



Phoneme counting test. The design was analagous to that for the 
mora-counting test, but items manipulated the number of phonemes instead of 
the number of mora. The four training series contained a variety of the 
possible two-, three-, and four-phoneme sequences of Japanese, including nasal 
mora, devoiced vowels, long vowels and geminate consonants. Each of the first 
three contained a progressive sequence of items [i.e., ho (sail), hon (book), 
hone (bone)], whereas the fourth did not (i.e., ta (field), kau (buy), shita 
(under)]. The test sequence contained 1^ twc-phoneme words, 1^ three-phoneme 
words, and 14 four-phoneme words arranged into a fixed random order. They 
comprised a broad sample of the permissible phoneme sequences in Japanese, 
including nasal mora, geminate consonants and vowels and devoiced vowels, 
which avoided systematic relationships between the number of phonemes a word 
contained, and either the number of mora in that word, or the number of Kana 
needed to spell it. There were four VV words, eight CV words, and two VC 
words in the two-phoneme pool; two VVV words, four VCV words, four CVV words, 
and four CVC words in the three-phoneme pool, and six CVCV words, two CVVV 
words, two VCCV words, two VCVV words, and two VVCV words in the four-phoneme 
pool. 



Angle counting test . The materials were simple black and white line 
drawings that appeared on three by five inch cards. From one to three 30° 
angles were embedded in each drawing and the task was to count the number of 
these angles. In keeping with the design of the phoneme- and mora-counti ng 
tests, there were four series of training trials; in the first three series, 
the items were a progressive sat of simple geometric shapes, but in the fourth 
they were objects that bore no systematic relationship to each other. The 
test sequence comprised drawings of objects, seven with one angle, seven with 
two angles and seven with three angles, arranged in a fixed random sequence. 
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Procedure 

Prior to testing, the children were divided into two groups of ten girls 
and ten boys each. One group received the mora counting test, the other 
received the phoneme-counting test, and both received the angle-counting test 
at the onset of the session and the reading test at the end. The procedure 
for all three counting tests was the same. The instructor (a native speaker 
of Japanese) took two small hammers and told the child that they would be 
playing a "counting game." He then demonstrated the first training series in 
progressive order by saying each word in a normal fashion (or displaying each 
card) and then tapping the number of mora, phonemes, or angles. Next, the 
demonstration was repeated, with the child copying the instructor (saying each 
word first), and then items in the series were presented in a fixed random 
order, and the child responded without benefit of demonstration. If an error 
was made, the item was repeated and presentation of another randomized series 
followed. Otherwise, training proceeded to the next series, until, on 
completion of the fourth training series, the test items were presented and 
the child was instructed to "count" each item without the benefit of response 
feedback. 

Results and Discussion 

In evaluating children's responses on the mora and phoneme counting 
materials, two different scores were computed: the number of correct 
responses (as in Mann & Liberman, 198^4), and a pass/fail score in which the 
criterion for passing was six consecutive correct responses (as in Liberman et 
al., 197^). Both appear in Table I along with mean age and mean reading 
scores for children in each group. The children who counted mora were 
equivalent to those who counted phonemes in terms of mean age, measures of 
reading ability, and performance on the angle-counting test (p>.05). However, 
whereas scores on the mora-counting test approached ceiling, scores on the 
phoneme-counting test were considerably lower, t(38)=^20.20, £<.0001 . In 
addition, all of the children had passed the mora counting test, whereas only 
10% had passed the phoneme counting test. The percentage of Japanese children 
who passed each test can be compared with the percentage of American first 
graders who had passed comparable tests in Liberman et al's original study: 
90% for syllable counting, and 70% for phoneme counting. Apparently, 
first-grade children who have been educated in the use of the alphabet tend to 
perform better on the phoneme counting test than those who have not. 
Moreover, while children who have been educated in a syllabary might do 
slightly better on the syllable counting test, any difference is less 
dramatic. At present, no strong conclusion can be reached about these 
differences and their implications: Different test materials were used in the 
two countries, and children were not told explicitly to focus on the spoken 
word as opposed to its orthographic representation. Both problems are 
surmounted in Experiment III, which employed 1) a common set of materials in 
the testing of Japanese and American first graders, and 2) instructions to 
manipulate the sound pattern of each item. 

Performance on each te3t gave indications of the influence of knowledge of 
Kana, In the mcra-counting test, children appeared to deduce that the task 
involved counting orthographic units rather than counting phonological units. 
The majority gave an extra "tap" to the three items that contained a mora 
spelled with two characters instead of one, as if they were counting the 
number of characters needed to spell the word, instead of the number of mora. 
Other, much less frequent, errors on this test involved words that contained 
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geminate consonants or long vowels, both of which tended to be underestimated 
and were missed only by the poorest readers of the group. 



Table I 

The Ability of Japanese First Graders to Count Mora vs. Phonemes 

SUBJECT GROUP 
MORA COUNTING PHONKME COUNTING 



Phonological Counting 

Mean No. Correct 38. 1 

(Max.«i^2) 

Percentage Passing 100.0 

Angle Counting 

Mean No. Correct 11.9 
(Max. -21) 

Kana Reading Ability 

Mean speed 61 . 1 

(in sec.) 

Mean errors 1 . 6 

(Max. -30) 

Mean teacher rating 1.9 

(Good»1 ,avg.»2, poor=3) 

Mean age (in months) 83.7 



18.1 
10.0 

1 1.8 



60.7 
1.8 

2.0 
8i^.1 



Analagous adherence to a "spelling strategy" can be found in children's 
responses to the phoneme-counting materials. During a post-hoc interview, 
some of the children reported that they had tapped the number of Kana 
characters needed to spell a given a word, and then added one to arrive at the 
correct response. Use of a "kana plus one" strategy could not allow children 
to reach the criteria of six consecutive correct responses, but it cerf;ainly 
inflated the number of correct responses. Items (N=25) for which the 
"Kana-plus-one" strategy yielded the appr jpriate response were correctly 
counted by an average of 55% of the children (which is significantly better 
than chance, ^(2^)«2.62, £<.05). In contrast, only an average of 38% had been 
correct on each item (N«17) for which that strategy yielded the incorrect 
response (which is significantly less than the percentage of children giving 
correct responses to the strategy-appropriate items, t(^0)«5.^, p<.001, and 
not significantly better than chance, £>.05). 

A final concern of this experiment was the relation between performance on 
each counting test and the ability to read Kana. For the children who learned 
to count mora, the number of correct responses on the mora counting test was 
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significantly related to teacher ratings, r(20)= .72, £<.0001, Hiragana 
reading speed, r(20)-.58, £<.003, and the number of errors, r(20)= -.-47, 
£<.02, but not to age, sex, or performance on the angle counting cest. This 
is consistent with Amano's (1970) report that mora counting ability is related 
to the acquisition of the first few Kana characters by pre-school children, 
and extends his finding to children in the first grade who possess 
considerably greater knowledge of the Kana syllabary. For the children who 
learned to count phonemes, the number of correct responses on the phoneme 
counting test was also sigrif icantly related to teacher ratings, r(20)=.56, 
£<.005, reading speed r(20)=. 65, p<.001, and reading errors, r(29)» ".57, 
£<.00^, but not to age, sex, or angle counting performance. 

Thus it would appear that performance on the phoneme counting test is 
related to the ability to read Kana even though Kana does not represent 
phonemes in any direct way. As both phoneme and syllable counting performance 
are related to the ability to read Hiragana, just as they are related to the 
ability to »^ead an alphabet, it is tempting to posit a general capacity for 
phonological awareness that is related to experience in reading any 
phonologically-based orthography. This capacity need not be part of general 
intelligence, given the results of some recent studies of American children 
(Mann & Liberman, 198^4; Stanovich et al., 198i4b), and the present i^^nding that 
there is no significant :)orrelation between measures of reading ability and 
performance on the angle counting test. It could be a general product of 
learning to read a phonological orthography rather than the cause of »^eading 
success, commensurate with children's reliance on Kana-based strategies. We 
will return to these issues in the final discussion. 

The results of Experiment I are consistent with previous reports that 
awareness of phonemes depends on the experience of learning to read an 
alphabet. Insofar as the majority of children could not pass the phoneme 
counting test. Nonetheless, two of the Japanese children did pass the test 
and our post-hoc interviews with them Indicated that they had received no 
instruction in the alphabet either at home, school, or "juku" (i.e., afternoon 
training programs). Thus, while there may be some facilitating effects of 
learning a syllabary on awareness of both phonemes and syllables, some other 
factors may lead to individual variations. As a further test of the viev/ that 
awareness of phonemes depends on the experience of learning to read an 
alphabet, we now turn to Experiment II, which focused on the phoneme counting 
ability of Japanese children in the third through sixth grades, comparing 
children at different grade levels in normal and "re-entering" classrooms. 

Experiment II 

Method 

Subjects 

The subjects were children attending the normal third- through sixth-grade 
classes and the special "re-entry" class at Ochanomizu University. The 
"normal class" subjects included 6^ children in the third and fourth grades, 
and 32 children in the fifth and sixth grades. The "re-entry class" subjects 
included 13 fourth graders, 1^4 fifth graders, and 12 sixth graders, all of 
whom haa learned to read either the English or German alphabet. Approximately 
equal numbers of boys and girls were included in each group and all served 
with parental permission. They were tested during the second trimester of 
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school, so that children in the normal fourth-grade classes had not yet 
received training in the alphabet. Consultation with the teachers, the 
principal, and the children themselves confirmed that none of the subjects in 
norral classrooms had received instruction in the alphabet at school, home or 
"juku". 

Materials and Procedure 

The materials were the mora- and phoneme-counting materials employed in 
Experiment I, administered by the same instructor. For convenience, the 
proceaure was adapted for group testing, in which case an entire class of 
children received the basic instructions and practice items with feedback, and 
learned to "count" each word by drawing slashes through the appropriate number 
of boxes in a five-box answer grid instead of by tapping the number of 
syllables/phonemes with a hammer. As in Experiment I, feedback was provided 
during training, but no feedback was provided during presentation of the test 
items. To insure the feasibility of group testing, the mora-counting 
materials were administered as a control measure to 32 of the third graders 
and 32 of the fourth graders. All of the remaining subjects received the 
phoneme-counting materials. 

Results and Discussion ' 

The data were scored in the manner of Experiment I, by computing both the 
number of correct responses and a pass/fail score. The results obtained from 
the mora-counting materials indicate the utility of the group testing 
procedure, as all of the third- and fourth-grade children had passed criterion 
with mean scores of 38.7 and 39.0, respectively. They also attest to the 
continuing power of the Kana orthography to mold the Japanese child's concept 
of language: As was the case in Experiment I, almost all of the children had 
made errors on the three test words in which the number of kana characters 
needed to spell the word surpasses the number of mora it contains. 

Performance on the phoneme counting test is summarized in Table II, 
according to the age of the subjects, and whether they were in the normal or 
re-entry classes. On the basis of previous findings that alphabet-illiterate 
adults are not aware of phonemes, it might be expected that normal Japanese 
third and fourth graders would be no more aware of phonemes than the Japanese 
first graders studied in Experiment I, whereas the normal fifth and sixth 
graders and all of the re-entry students would be comparable to the American 
first graders studied by Liberman et al (197^). Yet, the data fail to uphold 
that prediction. First, for children in the normal classrooms, whose data 
appear in the upper portion of Table II, the only marked improvement in 
phoneme counting scores occurs between the third and fourth grades, prior to 
any Instruction in the alphabetic principle. There is also no sharp spurt in 
the awareness of phonemes between fourth and fifth grades (£>.05), such as 
would be expected if instruction in the alphabet were critical. Second, 
fourth graders in the reentry group performed at the same level as their peers 
in the normal classrooms (£>.05), despite the fact that they alone had learned 
to read an alphabet. Third, and finally, the proportion of Japanese fourth 
graders who had passed criterion is comparable to that among the American 
children in Liberman et al.*s (197^) study, despite the fact that the Japanese 
children had not yet learned to read the Romaji alphabet. 
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Table II 

Phoneme Counting Ability Among Japanese Children in the 
Third to Sixth Grades: Normal vs. Reentering Students 



Third 



Grade 
Fourth 



Fifth 



Sixth 



Normal students 

Mean No. Correct 
(Max.=i42) 

Percentage Passing 
Age (in months) 



21.5 

56.2 
108.5 



30.3 

73.5 
120.1 



31.2 

81 .3 
131.2 



31.5 

75.0 
1^3.7 



Reentering students 
Mean No. Correct 
(Max. =142) 

Percentage Passing 
Age(in months) 



27.2 

60.0 
118.9 



28.6 

60.0 
132.7 



27.7 
80.0 



As in Experiment I, the importance of orthographic knowledge is illustrated 
by the pattern of errors, which suggests that at least some children were 
relying on the "Kana-plus-one" strategy of counting the number of characters 
needed to spell the word, and then adding one. Children at all ages tended to 
be most successful on items for which this strategy yielded the correct 
response: for strategy-appropriate items the average percent correct was 58$, 
80$, 81$, and 82$, for third through sixth graders, respectively, whereas that 
for the strategy- inappropriate items was 1^2$, 56$, 61^$, and 67$, respectively. 
Here, however, performance on both types of items surpassed the chance level 
of 33$ correct (£<.05), suggesting that appreciably many children at each age 
had been counting phonemes. 

A popular organization of the Kana syllabary places the characters in a 
grid with the vov,'8l mora in a different column to the far right of those 
containing characters for other mora. This organization had led us to 
anticipate that some of the subjects in Experiments I and II would use a 
strategy of giving the vowel mora one count and all other mora two counts. 
However, in post-hoc interviews of our subjects we found that none of them 
described such a strategy. Likewise, none of the children reported special 
treatment of the kana that can receive diacritics to mark the voicing of an 
initial stop consonant or fricative. Certainly it is possible that knowledge 
of Kana may have in some other way provoked children to reflect on the 
internal structure of words and thereby promoted phoneme awar- " , but we 
were unable to determine why. Although children master Kana by very early 
stages of first grade, the sharpest increase in phoneme countin^^ erformance 
occurs between third and fourth grade. Either increased experier. of a very 
general sort or some maturational factors could be responsible. 



In summary, although the findings of Experiment I suggest that both phoneme 
and syllable counting ability in the first grade might be facilitated by 
knowledge of an orthography that transcribes language at the level of that 
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unit, the findings of Experiment II suggest that, analagous to the many 
American children who become aware of syllables by age six without having 
learned to read a syllabary, many Japanese children may become able to count 
phonemes by age nine or ten, despite a lack of formal instruction in the 
alphabet. Moreover, at that age, training in the use of an alphabet does not 
particularly enhance the ability to count phonemes. This finding stands in 
contrast to findings that most alphabet-illiterate adurts appear to lack an 
awareness about phonemes. 

One possible explanation of the performance differences between alphabet- 
illiterate adults and Japanese children is that they reflect task differences 
rather than differences in phonological awareness, per se. Japanese children 
might appear to be more aware of phonemes because the counting tasks employed 
in Experiments I and II were not explicit as to whether "sounds" or characters 
were to be counted, leading to reliance on a Kana-based strategy that inflated 
the number of correct responses. However, use of such a strategy could not 
account for changes in the percentage of children who passed the phoneme 
counting test, which raises the possibility that children passed the test 
because it provided a less conservative measure of phoneme awareness than the 
deletion tasks uaed in studies of adults. The results of at least one study 
are commensurate with this latter possibility. Performance on counting tasks 
and deletion tasks emerged as separate factors in a study of the relation 
between phonological awareness and the reading progress of semi-literate 
adults enrolled in a remedial reading class (Read & Ruyter, 1985). Another 
study, however, reveals that task-differences are not of critical importance 
to the relation between phonological awareness and the future reading success 
of kindergarten children in America (Stanovich et al., 198^a). However, as 
this latter study did not include counting tests, it remains a possibility 
that performance on counting tasks involves a more accessible level of 
phonological awareness than performance on deletion tests, hence the 
apparently greater awareness of phonemes on the part of Japanese children 
relative to alphabet-'illiterate adults. 

If the above explanation is correct, the present findings should not extend 
to use of a deletion test. On such a test, Japanese children should behave 
as poorly as alphabet-illiterate adults. With this prediction in mind, we 
turn to Experiments III and IV, which attempted to replicate Experiments I and 
II with deletion tasks analagous to those employed by Morals et al. (1979) and 
by Read et al. (198^). Two sets of nonsense-word materials were designed, one 
for phoneme deletion and one for mora deletion. Nonsense words had been ainong 
the most difficult items for the adult subjects and therefore offer a 
maximally conservative measure of children's performance; they also permit 
parallel testing of Japanese and American children. 

Experiment III 

Method 

Subjects 

The subjects were ^0 Japanese first graders and ^0 American first graders. 
There were equally as many girls as boys, all of whom served with parental and 
teacher permission. The Japanese children were drawn from an available 
population of children who had not participated in Experiment I. Mean age was 
8^.^ months at the time of testing, which was midway through the second 
trimester of the school year. The American children were comparable in age 
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and SES, and were attending the Bolles Primary School in Jacksonville, FL. 
Mean age was 8^.1 months at the time of testing, which was early in the second 
semester of the school year. Measures of children's reading ability were 
obtained by having the teachers rate each child as good, average, or poor in 
reading ability, and by giving each ch.Ud a test of word decoding skill: the 
Hiragana reading test described in Experiment I for Japanese children, and the 
Word Identification and Word Attack Subtests of the Woodcock Reading Mastery 
Test (Woodcock, 1973) for American children. 

Materials 

A3 in Experiment I, two parallel sets of materials were designed, one for 
assessing syllable deletion ability and one for assessing phoneme deletion 
ability. The design of each was prompted by the methodology of Morals et al. 
(1 979) and Read et al. (1984): Each set of materials assessed deletion of two 
different tokens of the segment of interest, with blocked sequences of 
training items followed by test items. To make the items suitable for use in 
English and Japanese, it was necessary that they contain only those Japanese 
mora that bear a one-to-one relationship to English syllables. Thus, all 
items contained consonants and vowels shared by the two languages, and none of 
them contained long vowels, syllabic [n], geminate consonants, diphthongs, 
consonant clusters, or syllable-final consonants. Each test item, and the 
item f.'^rmed by removing its initial mora (or phoneme, as appropriate), was 
judged to be meaningless in Japanese (by the informants who judged the items 
of Experiment I) and in English (by conparable English-speaking informants). 



Syllable materials . These materials assessed children's ability to remove 
an initial syllable (mora), [ta] or [u], from a three-syllable/three-raora 
nonsense word. Twenty items started with [ta] and twenty with [u]; the second 
and third syllable of each word varied freely. For the purpose of testing, 
the items were blocked with respect to initial syllable, and each block was 
subdivided into ten practice items and ten test items. 



Phoneme materials . These materials assessed children's ability to remove 
an initial phoneme, [/] or [k], from a four- or six-phoneme (i.e., two or 
three syllables /mora) nonsense word. Twenty itfc.^s started with [/] and twenty 
with [k]. The second phoneme of each word was always one of the five 
permissible vowel;; such that, acro3S the items, each initial phoneme was 
followed by each vowel once in a four-phoneme word, and once in a six-phoneme 
word, with the remaining portion of each item varied freely. For the purpose 
of testing, the items were blocked with respect to initial phoneme, and each 
block was divided Into ten practice items and ten test items (such that two- 
and three-syllable words were equally divided between practice and test items, 
as were the five vowels that could occur in the second-phoneme position). 



Procedure 
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Children were tested individually by native speakers who used comparable 
instructions in the two languages. Within each country, half of the children 
received the syllable deletion test, half received the phoneme deletion test, 
and all received the reading test at the conclusion of the session. For each 
deletion test, presentation of practice and test trials was blocked with 
respect to initial segment (i.e., [ta] or [u], [/] or [k] ) with order 
counterbalanced across subjects. The instructor explained that the task 
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involved repeating a word and then trying to say it without the first sound. 
He or she then proceeded to demonstrate the first five practice items: saying 
each word, repeating it, and then saying it without the first syllable or 
phoneme. Next, each of these was repeated and the child was requested to 
imitate the Instructor by repeating the item and then saying it "without the 
first sound." Then the final five practice items were administered without 
benefit of demonstration, but with response feedback. Completion of the 
practice items was followed by the ten test items, which were administered 
without response feedback. Completion of the first block of trials was 
followed immediately by presentation of the second blocK of training and tost 
items. 

Results and Discussion 

Attempts to remove the initial segnient from each item were scored as 
correct or incorrect. The mean number of correct responses appear in Table 
III, separately for the American and Japanese children, according to the type 
and token of the segment being manipulated. When averaged across tasks and 
tokens, the scores of Americr n childr en are slightly superior, F( 1 ,76)=7. 31 , 
£<.009. With regard to the type of segment being deleted, children in both 

Table III 

Mora (Syllable) Elision Ability vs. Phoneme Ability: 
A Comparison of First Graders in Japan and America 



Japanese Children 



Mora Elision Phoneme Elision 

Cu] [ta] [/] [k] 



Mora Group 

Mean No. Correct: 9.15 9.55 

(Max. - 10, Age - 83.8 mo.) 
Phoneme Group 

Mean No. Correct: I.75 3. 10 

(Max. - 10, Age - 85.1 mo.) 

American Children 

Syllable Group 

Mean No. Correct: 8.90 8.8O 

(Max. - 10, Age - 83-5 mo.) 
Phoneme Group 

Mean No. Correct 5.72 5.61 

(Max. - 10, Age » 8i4.8 mo.) 

countries found the phoneme deletion task more difficult than the syllable 
(mora) deletion one, F( 1 , 70=87.6^, £<.0001 . However, the extent of 
difference between scores on the two tasks was greater for the Japanese 
children, F( 1 , 76)«1 3.01 , £<.0006. As compared to the American children, the 
Japan*»se children received higher scores on the syllable deletion task, t(38) = 
2.73» £<.05, but lower scores on the phoneme deletion task, t(38)= ^^.09, 
£<.01. There were no significant effects of token differences, nor 
interactions between this manipulation and other factors. 
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A further analysis considered the relations between phoneme and syllable 
deletion performance (summed across tokens) and reading ability in each 
country. As anticipated by the results of Experiment I, the mora deletion 
performance of the Japanese children was related to the speed, r(20)=.69, 
£<.001, and number of errors made on the Hiragana test, r(20)=.72, p<.001, and 
also to the teacher's ratings of reading ability, r(20)=.5^, £<.005. 
Likewise, their phoneme deletion ability also proved to be related to speed, 
r{20) = .37, £<.05, and errors on the Hiragana test, r(20)=^.38, p<.05, and to 
teacher ratings, r(20)«.iJ7, £<.02. For the American children, phoneme 
deletion ability was related to the sum of raw scores on the Woodcock tests, 
r(20)=.6l, £<.005, and to the teacher's ratings, r(20)*.57, £<.008, but 
syllable deletion ability was not related to either measure of reading 
ability. In neither language community was the age or sex of the first 
graders related to reading ability, mora deletion ability, or phoneme deletion 
ability (£>.1). 

The relative superiority of the American children in the caj?e of the 
phoneme deletion task corroborates previous indications that awareness about 
phonemes is faciliated by the learning of an alphabetic orthography. The 
analagous finding that Japanese children perform at a superior level on the 
syllable deletion task suggests that awareness about syllables may be likewise 
facilitated by learning to read a syllabary. Nonetheless, the finding that 
both Japanese and American children achieved higher levels of performance on 
the syllable deletion test than on the phoneme deletion test suggests that the 
ability to read a syllabary is less critical to awareness about syllables than 
the ability to read an alphabet is to awareness about phonemes. We now turn 
to Experiment IV, which attempted to replicate the findings of Experiment II 
regarding the contribution of orthographic knowledge to the phoneme deletion 
performance of Japanese children in normal fourth- and sixth-grade classrooms. 

Experiment IV 

Method 

Subjects 

The subjects were 20 fourth graders and 20 sixth graders attending the 
normal classes of the Ochanomizu Elementary School. Ten boys and ten girls 
from each grade were chosen at random from among the available pool of 
children who had not participated in Experiment II (i.e., those whose only 
experience with alphabetic instruction had occurred in school). All served 
with teacher and parental permission. Testing was conducted during the first 
trimester of the school year such that only the sixth graders had been 
educated in the use of an alphabetic orthography. Mean ages for each group 
were 117.1 and 1^^2.5 months, respectively. 

Materials and Procedure 

The materials and procedure for Experiment IV were the phoneme deletion 
materials employed in Experiment III. The only innovation was that, at the 
completion of the test session, each subject was given two of the test items 
to which he or she had responded correctly and was asked to explain how the 
correct response had been derived. This provided a test of whether subjects 
had relied on either a Kana-based or a Romaji-based spelling strategy. 
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Results 

The mean number of correct responses appears in Table IV, separated 
according to grade level and the phoneme token ([/] or [k]) being manipulated. 
It can be seen that the performance of the sixth graders surpassed that of the 
fourth graders, F(1,38)= 18.-49, p<.0001, consistent with the fact that: only 
the sixth graders had learned to use alphabetic transcription. When the 
present results were compared with those obtained in Experiment III (and shown 
in Table III), it was found that both the Japanese fourth and sixth graders 
had surpassed the Japanese first grauers in mean performance on the phoneriie 
deletion task, t(38) = -4.08, 2<.01 for fourth graders, and t(38)=-4.53, £<.01 for 
sixth graders. The Japanese fourth graders performed at the same level as the 
American first graders (£>.1), and the Japanese sixth graders had actually 
surpassed them, t^(38)=5.11, £<,01. 



Table IV 

Phoneme Elision Performance Among Older Japanese Children 

Phoneme Elision 
Grade in School [/] [k] 

Fourth Grade 

Mean No. Correct: ^,82 7.55 

(Max. - 10, Age - 117.1 mo.) 
Sixth Oade 

Mean No. Correct: 8.33 10.00 

(Max. « 10, Age « 1^2.5 mo.) 



To gain some appreciation of the Japanese children's knowledge of Romaji, 
we conducted an informal post-hoc interview with the five children who 
performed the best at each grade level. We found that none of the 
fourth-graders could read the nonsense test materials written in Romaji, 
whereas three of the sixth graders could do so. In contrast, although we had 
not asked the American children to try to read the test materials, they had 
been able to read an appreciable number of nonsense words on the Woodcock 
word-attack test. It may be remembered that the Japanese fourth graders had 
not received any instruction m Romaji, whereas the sixth graders had received 
approximately four weeks of instruction a full year and a half prior to the 
test session. The American first graders, on the other hand, had been 
receiving intensive phonics-based instruction in the use of the English 
alphabet for more than six months immediately prior to the test session. 

A further analysis reveals an effect of token variations: Both fourth and 
sixth graders tended to give more correct responses to items that began with 
[k] than to those that began with [/] , F(1 ,36 )=20.73k £<.0001 . This may be 
explained by hypothesizing a "character-substitution'* strategy based on the 
previously mentioned grid for representing the Kana syllabary as a matrix of 
rows and columns in which mora that share a vowel lie in the same row, and 
those that share a consonant lie in the same column. Within the. matrix, the 
character for [a] is to the immediate right of that f r [ka], [i] is to the 
immediate right of [ki], [u] to [ku], etc. Thus, children might be tempted to 
spell a word by replacing the first character with the character that lies to 
its immediate right on the matrix. Use of this strategy could cause [k] to be 
16 
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easier to delete than [/] because characters containing [k] are immediately 
adjacent to those for isolated vowels, whereas most that contain [/] are 
spelled with the character for [/i] with a subscripted character for [ya], 
[ye]f Cyu] or [yo] (according to the identity of the vowel). Moreover, they 
lie at the opposite end of the grid from the vowel characters, making it less 
obvious how to derive the character for the relevant vowel from that which 
represents the CV. 

In this regard, we had actually asked children to explain how they had beew 
able to arrive at a correct response. Of the fourth g»^aders, seven were 
unable to describe their strategy at all, nine gave evic'.encc of using the 
"character substitution strategy," and four subjects described a 
"phonological" strategy that more or less amounted to doubling the vowel of 
the first syllable in a word and then removing the initial consonant -vowel 
portion (i.e., making [ki-pi] into [ki-i-pi], and then deleting [ki] to yiel^ 
[i-pi]. The children who reported the "phonological strategy" had achieved 
some of the best scores in their age group, and they tended to be equally 
accurate in their reaponses to items containing [k] and [/]. As for the sixth 
graders, all of whom had been exposed to the alphabet, only four appeared to 
have employed the "character substitution strategy", and they achieved some of 
the lowest scores in their age group especially for items that began with [/]. 
Fifteen of the remaining children reported some vers.ton of the "phonological 
strategy, '' and only a single child reported a strategy of using Romaji. 

General Discussion 

The present study asked whether Japanese children's awareness of syllables 
and phonemes differs from that of American children, aS5 a consequence of their 
having learned to read a syllabary instead of an alphabet. The results 
clearly showed that Japanese children's approach to phonological counting and 
deletions tests is influenced by their reading experience. Knowledge of the 
Kana syllabary tended to confound performance on tasks that attempted to 
assess ability to manipulate phonological units, whether the tasks involved 
counting or deleting phonemes or syllables, and whether the instructions were 
ambiguous or explicit as to whether orthographic or sound units were being 
counted. Younger children in particular tended to manipulate the characters 
that spell a word rather than the phonological units that the characters 
transcribe, This tendency has previously been observed among American 
children (Ehri & Wilce, 1980) and has been one form of evidence that knowledge 
of an alphabet is responsible for phoneme awareness. 

The results further reveal performance differences between first graders in 
Japan and America and illustrate that knowledge of a syllabary/logography as 
opposed to an alphabet can have a very specific effect on phonological 
awareness. Relative to first graders in Japan, first graders in America can 
more accurately count the number of phonemes in words and can more accurately 
remove the initial phonemes from nonsense words. Thus, the experience of 
learning to read an alphabet must facilitate children's awareness of phonemes 
at this age. The analagous finding that Japanese children can surpass 
American children in performance on tasks that call for syllable manipulation 
likewise reveals that experience with a syllabary can facilitate the awareness 
of syllables. However, children, in general, find syllable manipulation an 
easier task than phoneme manipulation, which suggests tnat the experience of 
learning to read a syllabary vs. an alphabet is not the sole determinant of 
phonological awareness. 
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Whf\t might the other determinants be? First of all, the development of 
phonological awareness may be a multi -faceted process that depends on the 
abstract ness of the unit at issue. Syllables, as compared to phonemes, are 
isolable acoustic segments; they are more superficial, less encoded components 
of the speech signal. Thus it is reasonable that syllable awareness should be 
an easier p more natural achievement of such factors as cognitive maturation 
?ind primary language development, requiring less special cultivating 
experience than awareness of phonemes* The results of previous research favor 
this view (Liberman et al., 197^; Alegria et al., 1982; Read et al., 19b^)e 
While awareness of syllables may be a precursor of awareness of phonemes, it 
is not sufficient, given that sc^e individuals can manipulate syllables but 
not phonemes. Previous research had suggested that the ability to manipulate 
phonemes depends on knowledge of an alphabet (Byrne & Ledez, 1986; Liberman et 
al., 1985; Morals et al., 1979; Read et al., 198^1), but the present study 
suggests that other factors can also play a role. 

The findings of Experiments II and IV emphasize the role of factors other 
than knowledge of the alphabet in the development of phoneme awareness, by 
revealing that, whereas most Japanese first graders could manipulate syllables 
but not phonemes, tne majority of Japanese children were able to manipulate 
both syllables and phonemes by the fourth grade, whether or not they had been 
instructed in the use of an alphabet. Thus, with increasing age and 
educational experience, Japanese children may become more and more capable of 
manipulating phonemes whether ur not they are alphabet-literate. 

This finding stands in contrast to previous reports that adults who do not 
know how to read an alphabet are not aware of phonemes, and some explanation 
is required. We may disregard the possibility that the differences between 
Japanese children and the alphabet-illiterate adults are due to task 
differences rather than differences in phonological awareness, per se. A 
concern with this possibility prompted Experiments III and IV, which emp- jyed 
deletion tasks analagous to those used in previous studies of illiterate 
adults. The results obtained in these experiments are much the same as those 
obtained with the counting tasks employj^a in Experiments I and II. This 
accords with some other observations that the task-unique cognitive demands 
posed by different tests of phonological awareness do not appr'eciably confound 
conclusions about young children* s phonological awareness and its role in 
reading acquisition (Stanovich et al., 198iia). 

Perhaps a more reasonable interpretation is to accept the differences 
between the present findings and those obtained with alphabet-illiterate 
adults as differences in phonological awareness. We might then explore the 
possibility that other types of secondary language activity are responsible 
for the superior phonological awareness of the older Japanese children. One 
clear likelihood is that awareness of both syllables and phonemes is promoted 
by the experience of learning Kana, owing to the fact that it is a 
phonological orthography. This accords with the fact that many of the adults 
who proved deficient in phoneme awareness were functional illiterates (i.e., 
the American and Portugese adults). It would also accord with the 
correlations between Kana reading ability and both syllable and phoneme 
awareness, observed in Experiments I and III (although the correlation leaves 
causality ambiguous). It might seem inconsistent with certain findings (i.e. 
Experiment III and Mann, 198^1) that syllable awareness fails to correlate with 
the ability to read the alphabet, but ceiling effects are a possible 
confounding factor. Other studies, however, have reported a correlation 
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between syllable awareness and reading ability (see, for example, Mann & 
Liberman, 198i|; Alegria et al., 1982 ). 

A more serious problem with the view that knowledge of a phonological 
orthography promotes all aspects of phonological awareness concerns the lack 
of phoneme awareness among adult readers of the Chinese orthography (Read et 
ai., 198iJ). As noted by Gelb (1 963), Chinese, the most logographic of all the 
writing systems, is not a pure logographic system because from the earliest 
times certain characters have represented not words but phonological units. 
Many Chinese characters, the "phonetic ccrnpounds, " are composed of a radical 
and a phonetic, each of which otherwise represents a word of the language. As 
noted by Leong (in press), the "fanqui" principal has been employed since 600 
A.D. for decoding phonetic compounds, a strategy that calls for blending the 
first part (initial consonant) and the tone of the word represented by the 
phonetic with the final part (syllable rhyme) of the word represented by the 
radical. Thus a compound, e.g., composed of "t'u" and "I'iau," decodes as 
"t'iau." Several Chinese colleagues inform me that classical methods of 
education in the Chinese logography have explicitly called the reader's 
atten* ^*on to the ohonetic components. Moreover, although phonological changes 
have necessarily altered the relationship between phonetic compounds and the 
words they represent, one recent study reveals that the adult readers of 
Chinese make use of the phonetic insofar as they name low-frequency (but not 
high-frequency) characters that involve phonetic compounds faster than 
non-phonetic compound characters (Seidenberg, 1985). Likewise, adult readers 
of Chinese can use phonetic radicals productively (Fong, Horne, & Tzeng, 
1986), to give consistent pronunciations for nonsense logographs composed of 
radicals and phonetics that do not co-occur. Given these findings, it is 
somewhat puzzling that exposure to phonetic compounds did not promote 
phonological awareness omons Read et al. 's subjects, if exposure to any 
phonological orthgography facilitates phoneme awareness. 

Putting aside the role of reading experience, it is possible that phoneme 
awareness is facilitated by some other secondary language experience that is 
available to Japanese children but not to the adults studied in Portugal and 
China. For Japanese children, the appropriate experience might involve 
learning to analyze or manipulate the phonological structure of spoken words 
while playing word games like "Shiritori" or while learning about Haiku. That 
the experience facilitating phonological awareness need not be limited to 
reading is evident from previous findings about the utility of explicit 
training in phonemic analysis (see Treiman & Baron, 1983 • for example). 
Exposure to nursery rhymes and other poetry, for example, could help to 
e::plain why many American children are aware of syllables before they learn to 
read. But it would have to be argued that experience with euch secondary 
language activities facilitates the development of all aspects of phonological 
awareness in a very general way, else how are we to explain the fact that 
Japanese children became able to manipulate phonemes despite a lack of 
experience with games and versification devices that directly manipulate 
phoneme-sized units? Even if it is postulated that any secondary language 
experience that manipulates phonological structure can give rise to awareness 
of both syllables and phonemes, there remains a problem insofar as meter and 
rhyme are exploited by both Chinese and Portugese verse, song lyrics, etc., 
and would probably have been available to the illiterate adults who 
nonetheless lacked phoneme awareness. A further problem arises from the fact 
that, in the present study, all of the children were familiar with the Kana 
syllabary and the same types of word games and versification devices, yet only 
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a small minority of the first graders (10$) *vere able to count phonemes, 
whereas the majority of fourth graders could do so. 

A similar argument can be made against the view that Japanese children knew 
about phonemes because thev had seen signs, labels, etc. written in the 
Romaji alphabet. Any explanation that passive exposure to the Romaji alphabet 
is responsible for the phoneme awareness of Japanese children would have to 
account for the fact that all children are exposed to Romaji signs and logos, 
yet only those aged nine and older had profited from that exposure. It would 
also have to account for the fact that passive exposure to alphabetically- 
written material failed to promote phoneme awareness among the Portugese 
adults studied by Morals et al. (1979). 

One final explanation of the differences between the present results and 
those obtained with alphabet-illiterate adults remains. Th*^ ability to 
manipulate both syllable and phoneme- si zed units could be a natural 
concomitant of primary laiiguage development that is exploited by many 
secondary language activities such as reading, versification, and word games. 
But if this capacity is a natural concomitant of primary language, how can it 
be deficient in alphabet-illiterate adults? Perhaps the ability to manipulate 
phonemes tends to atrophy unless maintained by appropriate reading experience. 
It has often been speculated that children acquire their primary language with 
the aid of a language acquisition device that is not present in adults. That 
the capacity for manipulating phonemes could be part and parcel of a language 
acquisition device follows from a suggestion made by Mattingly (198^), in 
answer to the question of why readers might be able to gain access to the 
otherwise reflexive processes that support the processing of phonological 
structure in spoken language. He suggests that an ability to analyze the 
phonological structure of spoken words might serve to increase the language 
learner's stock of lexical entries, and this, together with some other 
evidence that children have a privileged ability to acquire new lexical 
entries (Carey, 1978), could lead to the speculation that children have a 
privileged ability to manipulate phonological structure that somehow 
facilitates their ability to engage in secondary language activities that 
involve manipulations of phonological units. The prevalence of thiu capacity 
in childhood could promote children's acquisition of phonological 
orthographies during their elementary school years and by postulating that 
this capacity in the absence of appropriate orthographic knowledge, one might 
explain the lack of phoneme awareness observed among alphabet-illiterate 
adults. However, this view is not without its problems, one being the fact 
that Japanese children could not do well on either the counting or elision 
tasks until relatively late in their childhood. Here, the cognitive demands 
of tests that are used to measure phoneme awareness and the confounding role 
of orthographic knowledge cannot be disregarded. Ongoing research with a 
broader battery of tests and a broader range of ages may further elucidate the 
basis of phonological awareness in the interplay between cognitive skills, 
primary language skills, and experience with secondary language activities 
such as reading. 
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AN INVESTIGATION OF SPEECH PERCEPTION ABILITIES IN CHILDREN WHO 
DIFFER IN READING SKILL 



Susan Brady, t Erika Poggie,tt and Michele Merlott 



Abstract , Considerable evidence indicates that children who are 
poor readers have a phonetic coding deficit on linguistic short-term 
memory tasks. A previous study (Brady, Shankweiler, & Mann, 1983) 
ha<?. explored whether the initial perception of items might be the 
locus of the memory problem, and had demonstrated inferior speech 
perception abilities for poor readers with degraded stimuli. In the 
present study, the goal was to look more closely at perception under 
clear listening conditions. Third-grade good and poor readers were 
tested on a word repetition task with monosyllabic, multisyllabic, 
and pseudoword stimuli. Poor readers were significantly less 
accurate on the more demanding multisyllabic and pseudoword stimuli, 
though no group differences were obtained on speed of responding. 
The lack of reaction time differences between good and poor readers 
was corroborated on a control task in which verbal response time to 
nonspsech stimuli was measured. The reduced accuracy with clearly 
presented stimuli confirms the presence of subtle deficiencies in 
speech perception for children with reading difficulty and 
strengthens the hypothesis that poor readers' memory deficits may 
stem from less efficient encoding processes. 

Evidence has been steadily mounting that the associates of early reading 
difficulty lie in the phonological domain. One of the central areas of 
research contributing to this evidence has involved studies of short-term 
memory (STM). Children v'ith reading problems have repeatedly been observed to 
have deficient recall on STM tasks when compared with better reading peers. 
The role of phonological processes iri this deficit has been implicated by 
several findings: First, the memory deficit for poor readers is observed only 
for stimuli that can be phonetically receded such as letters, words, and 
pictures of nameable objects. When stimuli are presented for recall that are 
not easily given a phonetic code, good and poor readers perform comparably* 
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This contrasting result has been obtained with tasks employing photographs of 
strangers, nonsense doodle drawings, symbols from an unfamiliar writing 
system, and with auditorily presented tones (Holmes & McKeever, 1979; Katz, 
Shankweiler, & Liberman, 1981; Liberman, Mann, Shankweiler, & Werfelman, 19£2; 
Vellutino, Pruzek, Steger, & Meshoulan, 1973). Thus the limits in STM for 
children with reading difficulty are specific to tasks requiring phonetic 
coding. 

Second, when the STM tasks consist of linguistic material, manipulations 
of phonetic dimensions of the stimuli generally affect the performance of 
young good readers more than that of young poor readers (Liberman 
Shankweiler, 1979; Shankweiler, Liberman, Mark, Fowler, & Fischer, 1979). 
With strings of phonetically distinct (nonrhyming) stimuli, good readers show 
the usual superior recall for verbal material. When the phonetic 
conf usability is increased by presenting rhyming items, the performance of 
good readers is impaired much more than the recall of poor readers. It has 
been reasoned that this pattern, also observed in adults, stems from the 
skilled readers being better able to form a sufficient phonetic code for 
temporary storage of inf ormat^ion. Stimuli that minimize the phonetic 
contrasts between items in STM, such as lists of rhyming words, thus tend to 
have a greater effect on the recall of the good readers. Therefore, 
differential sensitivity to phonetic similarity by reading groups has been 
seen as a consequence of differing levels of skill in the use of a phonetic 
code. 

Subsequent studies have indicated that poor readers employ a phonetic 
code, but do so less accurately than good readers. Examining the nature of 
errors on verbal STM tasks, both reading groups produce phonetically-based 
mistakes such as transpositions of phonetic elements. However, the incidence 
of these errors is more frequent for the children with reading difficulty 
(Brady, Mann, & Schmidt, 1985; Brady, Shankweiler, & Mann, 1933). Additional 
research indicates that poor readers are not worse on all components of 
language processing:* when other linguistic variables in STM tasks are 
experimentally varied, such as syntactic and semantic parameters, reading 
groups are equally affected (Mann, Liberman, & Shankweiler, 1980). Thus the 
memory deficits of poor readers are uniquely associated with phonetic 
requirements in STM, not with other aspects of language processing. 

An important insight about the extent of the phonetic coding problem 
arises from the observation that reading groups differ in STM recall whether 
the lists a.^e presented visually or auditorily (Brady et al., 1983; Brady et 
al., 1985; Mann et al., 1980; Shankweiler et al., 1979)« This finding 
suggests that poor readers experience a general difficulty in the use of a 
phonetic code, rather than an impairment specific to the encoding of visual 
information. 

To summarize, poor readers derr.onstrate short-term memory deficits only 
for ntimuli that are phonetically recodable. These children show reduced 
sensitivity to rhyme and greater frequency of phonetic errors of 
transposition, providing further suppor^t that the deficit is related to 
phonetic skills. Lastly, these results are independent of the modality of 
presentation, pointing to a general phonetic processing deficit in STM. 
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The current evidence is consistent with the view that the short-term 
memory deficit of poor readers stems from deficiencies in the use of a 
phonetic code. In exploring the phonetic basis of the memory problem, we have 
been conducting experiments to determine whether the problem arises in 
perception with the encoding of stimuli. If so, poor readers can be expected 
to do less well on perception as well as on recall tasks than good readers. 
This finding was obtained in a previous study (Brady et al., 1983) in which 
third-grade poor readers oerformed less accurately than good readers on a 
speech perception task requiring identification of words presented in noise. 
In contrast, the reading groups did not differ in performance on a nonspeech 
control task with environmental sounds. 

At the present we are working with the hypothesis that the difficulties 
of poor readers in speech perception and verbal STM tasks arise from a common 
source: the creation and maintenance of phonetic representations. From this 
approach, the efficiency with which the input is encoded will have 
consequences both in perception and in memory. Rabbitt (1968) carried out 
experiments with adults that supported th^s hypothesis. When digits were 
degraded slightly by the addition of noise, memory was observed to suffer, 
even though identification of the digits in isolation was still accurate. 
Rabbitt proposed that limited processing capacity was the basis for the 
reduction in memory span. That is, as increased resources were required for 
identification of the digits in noise, relatively less processing capacity was 
available for retaining the items in memory. 

Similar explanations have been offered for the commonly observed 
developmental increases in STM (Chi, 1976; Dempster, 1981), and the individual 
differences in memory span for adults (Baddeley, Thompson, & Buchanan, 1975; 
Hooslan, 1982). Hulme, Thomson, i^uir» and Lawrence (198^4) report that 
a.; though younger children recall less, the same linear function relates 
speaking rate to short-term memory for subjects ranging in age from four years 
old to adulthood. They suggest that speech rate can be seen ''.s a measure of 
rehearsal speed, so that increases in speech rate, rather th\n in .'jiemory span 
per se, account fcr the observed gains in STM during development. Case, 
Kurland, and Goldberg (1982) likewise found that speed of word repetition 
correlated with memory span scores for children three to six years of age. 
These authors propose the slightly different explanation that basic operations 
in perception and memory become more efficient with experience^ requiring less 
processing space, and that as a consequence mo^'e functional space exists for 
storage. In an interesting te.'-^t of this. Case et al, equated six year oJds 
and adults on speed of word repetition by manipulating word familiarity, and 
correspondingly found that the word spans for these two age groups were no 
longer different, 

Givf^n that the efficiency of phonetic processes appears to be related to 
nornrial developmental increases in memory span» it is of added importance to 
evaluate whether the STM differen-^es associo^ted with reading ability might 
arise from the efficiency of i-honetic encoding. Since poor readers in the 
Brady et al. (1983) study made more errors repeating the speech--in--noise, it 
appears their perceptual skills are less well developed than those of children 
who are good readers. Therefore, it might also be the case that under clear 
listening circumstances the poor reader's encoding, though adequate, is less 
efficient (i.e.* may require more process in. resources ) and may limit 
performance on recall tasks. 
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To test this line of reasoning, we wanted to investigate whether 
differences in efficiency in perception are present under clear listening; 
conditions, since this is the way stimuli are presented for most STi-i 
experiments. In the Brady et al. (1983) study, no perceptual differences 
between reading groups had been observed on accuracy scores for a noise-free 
ta3k with monosyllabic words. However, since both good and poor readers had 
been at ceiling perf or^iiance levels, it may not have been a sufficiently 
sensitive procedure to assess group differences in perceiving clear stimuli. 

If reading group differences in perceptual processing efficiency are 
present for clear listening, we speculated that this might take one or two 
forms: 1) poor readers might be slower at identifying or producing a phonetic 
utterance; 2) the quality of the phonetic representation might be less fully 
accurate for the poor readers. Under clear listening conditions with no time 
constraints and with relatively easy phonetic stimuli, poor readers could 
conceivably perform well with either or both of these processing limitations. 

With these questions in mind we examined tnird-grade good and poor 
readers on a speech repetition task. Our aim was to look more closely at 
whether reading group differences in perception are evident when stimuli are 
presented clearly. The responses were scored for accuracy, and reaction time 
(RT) measures were collected to assess processing speed. Three kinds of 
stimuli were presented: monosyllabic words, multisyllabic words, and 
pseudowords. In this way the phonological demands o^ the task were varied in 
case monosyllabic words (previously tested) were not sufficiently difficult to 
process to reveal potential group differences. Therefore the length of the 
stimuli was increased in the multisyllabic condition and the familiarity was 
decreased in the pseudoword condition. Both of these are known to increase 
processing demands in adults and we expected this also to be true for 
children. We hypothesized that the reduced accuracy of poor readers on 
speech-in-noise (Brady et al., 1983) had reflected on-going differences in 
perceptual skills that are only apparent on somewhat demanding tasks. 
Consequently we predicted that poor readers would be less accurate than good 
readers on the more difficult multisyllabic and pjeudoword stimuli, but: that 
both groups would do well on the monosyllabic items. 

The reaction time measure allows us to address the speed of processing 
issue raised in the developmental literature (e.g., Hulme et al., 198^). If 
group differences in RT were evident, we predicted that the good readers would 
be faster, indicating more rapid phonetic processing capabilities and possibly 
reflecting a developmental advantage. 

Anticipating that differences in reaction time might be present for good 
and poor readers, a control task was included so it would be possible to focus 
on what aspect of the repetition task was implicated. In this task, subjects 
were presented with nonspeech tones to which they were to respond rapidly with 
a specified word. If potential group RT differences in the word repetition 
task were related to articulation speed, reading group differences should be 
maintained on ^.he control task. If instead they stemmed fran identification 
processes for a phonetic input, the tone stimuli should not g6.nerat»> group RT 
differences. 
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Methods 

Subjects . The subjects were third-grade children from a suburban school 
district in southern Rhode Island. The school reading coordin?tor targeted 
the children she thought would qualify as good or poor readers. These 
children were then administered the Word Attack and Word Recognition subtests 
of the Woodcock Reading MasC;ery Tests, Form A (Woodcock, 1973) » and a test of 
receptive vocabulary, the Peabody Picture Vocabulary Test-Revised (PPVT-R; 
Dunn, 1981). In addition, the children were screened for hearing loss. Using 
a standard audicraeter, each child's right and left ears were tested with tones 
at 5C0 Hz (25dB), 1000 Hz (20dB), 2000 Hz (20 dB) , ^000 Hz (20 dB) and 8OOO Hz 
(20 dB). 

Children were selected <i3 subjects if they met the following criteria: 
(1) To ensure appropriate classification as a good or poor reader an 
individual was included only if the two scores on the Woodcock subtests were 
consistent (i.e., if both scores indicated a comparable level of reading 
ability.) (2) In order to limit the range of vocabulary skills, participation 
was restricted to those with PPVT-R IQ scores between 90 and 125. (2) Because 
of the auditory requirements of the experimental tasKS, only children who 
passed the hearing screening were eligible. In accord with routine 
procedures, an individual passed the screening if no more than a single 
frequency on each ear was undetected. (^) Given the evidence that the speech 
perception skills of children continue to progress during elementary school 
years (Finkenbindor , 1973; Goldman, Fristoe, & Woodcock, 1970; Schwartz & 
Goldman, 197^; Thompson, 1963), selection of subjects was limited to those 
whose ages fell within a one year span (101-113 mos.). 

Thirty children (15 good readers and 15 poor readers) met the 
requirements for inclusion in the study. The characteristics of the two 
reading groups are summarized in Table 1. The Woodcoct^ test scores were 
non-overlapping for the good and poor reading groups. The 15 children who 
were designated good readers were clearly beyond third grade reading mastery, 
with a mean reading grade level of 7.8. The 15 children who were labeled poor 
readers had an average lag of nine months below their expected level (x«3.1). 
Neither the ages, F( 1,28) =.26, p « .61, nor the PPVT-R IQ scores, F(1,28) = 
1.23, p = 2.8, of the good and poor readers were significantly different. 



Table 1 

Means for Third Grade Children Grouped According to Readin? Achievement 

Group N Age IQ^ Reading Grade^ 

Good 15 8 yr. 9 mo. 108.I 7.8 

Poor 15 8 yr. 10 mo. 10^.5 3.1 

^Peabody Picture Vocabulary Test 
From the average of the reading grade scores obtained on the Word Attack 
and Wcrd Recosnition subtests of the Woodcock Reading Mastery Tests, Form A. 
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Stimuli . Three sets of stimuli were used: (1) a set of ^^8 monosyllabic 
words; (2) a set of 2^ monosyllabic pseudowords, and (3) a set of 2^ 
multisyllabic words. In addition, a 2^ item control task was employed. 

Monosyllabic words. The monosyllabic word list (MONO) was the same as that 
used in a previous study (Brady et al., 19S3). The words were chosen to 
control for syllable pattern, phonetic composition, and word frequency. There 
were 12 words for each of four syllabic patterns: CVC 
(consonant-vowel-consonant), CCi^C, CCVCC, and CVCC. In addition, the words 
were chosen to provide a systematic phonetic set. Twenty words began with 
stop consonants (/b/, /d/, /g/, /p/, /t/, /k/), twenty words began with 
fricatives, or affricates (/t//, /s/, /f/, ///, /dz/, /v/), and four began 
with liquids (/r/, /I/). The same distribution of phonemes occurred in vord 
final position. 

For each syllable and phoneme pattern, half of the words included were 
reported to have a high frequency of occurrence in children's literature and 
half to have a low frequency (Carroll, Davies, & Richman, 1971). The words 
used are presented in Table 2. 



Table 2 
Monosyllabic Stimuli 
Words Pseudowords 



High Frequency Low Frequency 



door 


bale 


dar 


team 


din 


tem 


road 


lobe 


rud 


knife 


mash 


nauf 


chief 


chef 


chif e 


Job 


fig 


jeeb 


grain 


tram 


grun 


breath 


grouse 


brath 


crowd 


crag 


crad 


sleep 


slag 


slape 


scale 


spire 


skell 


speech 


skiff 


spoach 


front 


flint 


f rant 


plant 


clamp 


plint 


friend 


frond 


f reend 


clouds 


glacJ(:'3 


deeds 


blocks 


drapes 


blakes 


planes 


prunes 


pleens 


bank 


kink 


bink 


chance 


finch 


chounce 


list 


rasp 


liced 


month 


nymph 


manth 


child 


vault 


chauld 


ships 


shacks 


shaps 
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Monosyllabic pseudowords. A set of 2k monosyllabic pseudowords (PSEUDO) 
was created by scrambling the medial vowels in the high frequency word set. 
In this way syllabic and phonetic patterns permissible in Engli'^h phonology 
were maintained. The frequency of occurrence for these patterns was held 
constant for the word and pseudoword stimuli. Four adult speakers of English 
listened to the pseudoword items and judged whether each stimulus could be an 
acceptable word in English. Two vowel reassignments were made in accord with 
this feedback, resulting in the pseudoword stimuli listed in Table 2. 

Multisyllabic words . The multisyllabic stimuli (MULTI) were three- and 
four-syllable nouns, all pronounced with stress on the first syllable. Since 
it is more difficult to control stri ctly for phoneti c parameters in 
multisyllabic words, the items were selected to represent an array of syllabic 
and phonetic constructions. For each syllable length an equal number of high 
frequency and low frequency words was included, again based on word counts 
from Carroll et al. (1971 )« The multisyllabic stimuli are listed in Table 3. 



Table 3 
Multisyllabic Stimuli 
High frequency Low frequency 



basketball 


badminton 


medicine 


marmalade 


furniture 


refugees 


neighborhood 


saddlebag 


vitamins 


vinegar 


satellite 


silicone 


television 


dormitory 


agriculture 


anesthetic 


helicopter 


honeysuckle 


supermarket 


salamander 


military 


malnutrition 


kindergarten 


gladiators 



Stimulus preparation. The stimuli were recorded by a phonetically trained 
male speaker, with each produced as the final word of a meaningful sentence. 
The sentences were later digitized at 20,000 samples/sec and each stimulus was 
excised from the sentence, using the Haskins WENDY waveform editing system. 
The items were arranged into a fixed random sequence for each set of stimuli 
and were then recorded onto one channel of a magnetic tape with an 
inter-stimulus-interval (ISI) of ^ sees. At the same time, a series of pulses 
to be used for timing purposes was recorded on the second channel of the 
magnetic tape. A pulse was aligned temporally with the onset of each stimulus 
item. 

Control task . A brief 2000 Hz (100 ms) tone was recorded 2^ times in two 
blocks of 12 trials on one channel of an audiotape. The ISI randomly varied 
with intervals ranging from 2.5 sec to 5 sec. To enable reaction time 
measures, a pulse was recorded on the second channel to co-occur with each 
tone . 29 
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Apparatus . The stimuli were replayed on a reel-to-reel tape recorder. One 
channel, containing the stimuli, was output to the subject and to the 
experimenter via open^-air soft-cushion headphones. The other channel, with 
the pulses, was connected to the onset trigger of a timer. As each word or 
pseudoword was produced on the tape recorder, the pulse triggered the counter 
on the timer. The subject would repeat the stimulus, as rapidly as possible, 
speaking into a pair of microphones centered in front of the subject. One of 
the microphones contained a voice key, which would terminate the counter. The 
resulting reaction time, displayed digitally, was written by the experimenter 
and was output from a printer. Via the bojond microphone, the subjects* 
responses were recorded on audiotape. Transcriptions of the responses were 
also made during the testing session. The response tapes were listened to 
later in the day in order to corroborate the transcription and to allow any 
necessary corrections. The same apparatus was used for the control task. 

Procedure . Each child was tested individually in a quiet room for three 
sessions. The first session included the Woodcock reading tasks and the 
Peabody Picture Vocabulary test. In the second session, occurring at least a 
week later, the children were given the hearing screening and the monosyllabic 
word reaction-time task. The third session, occurring approximately another 
week after the second, included the multisyllabic word RT task, the 
monosyllabic pseudoword RT task, and the control task. We elected to preaent 
the conditions in a single order that we felt would be easy for third graders 
to follow. 



For the speech stimuli tasks, the subjects were asked to say what they 
heard as quickly as possible. While speed was encouraged, the children were 
also instructed to say the words distinctly. Prior to the RT tasks, the 
subjects practiced repeating words said by the experimenter and then practiced 
repeating preliminary items on the tape. 

For the control task, subjects were instructed for the first twelve trials 

to say the word /cat/, as rapidly as possible, when a tone was heard. For the 

second block of twelve trials subjects were told to say /banana/ upon hearing 
a tone. 



Results and Discussion 



The responses were analyzed in terms of accuracy (number correct) and speed 
(reaction time). 



Accuracy scores. The responses were scored for phonetic accuracy. Each 
item was scored as correct or incorrect. If a subject stuttered or stammered 
during a response, this was not counted as an error. Any other misproduction , 
changing the phonetic description of the item, was noted as an incorrect 
response. The results are presented in Figure 1. Since the order of 
presentation of condi tions was not counterbalanced , comparisons between 
performance of reading groups will be made within each set. 

On the monosyllabic words, which we had characterized as the least 
difficult set, the reading groups performed comparably, F( 1,28)=. 79. p=.38. 
More errors occurred on the low frequency words, F( 1 , 28) =39.79, p<.0001, but 
this was true for both reading groups, as can be seen in the lack of a 
frequency x group interaction, F(1,28)«.6l, However, with the more 

demanding conditions, the poor readers p-^oduced significantly more errors. On 
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□ Good Readers 0 Poor Readers 




HIGH LOW TOTAL HIGH LOW TOTAL TOTAL 
I I I i 

WORD FREQUENCY WORD FREQUENCY 



Figure 1. Accuracy performance of good and poor readers, plotted in mean 
percent correct . 

the multisyllabic stimuli, group differences were obtained on the entire set, 
F(1,28)=8, p=.O09, and on both the high frequency, F( 1 , 28)=5. ^9 , P=.03, and 
low frequency, F(1 ,r8)=6.^5, p=.02, stimuli. Once again there was an overall 
effect of word frequency, F(1 ,28)=7.78, p=.01, but this did not differ for 
good and poor recJers, F(1,28)=.66, P=.^2. An additional analysis was 
performed on the MULTI data^ examining the effect of the length of the stimuli 
on the error rate. Both good and poor readers tended to produce more errors 
on the longer, four-syllable items, though this pattern was not significant, 
F(1 ,28)*3.iJ3, p = *08. While longer utterances may be more difficult to 
process, the particular phonetic sequence required appears to be a more 
salient factor. For example, in the four syllable stimuli no errors were 
obtained on the item /salamander/ while many children mispronounced the 
cluster in /agriculture/. 

Since word frequency effects were obtained on both the MONO and MULTI 
conditions, one might predict an even higher error rate on the pseuddwords , 
given that subjects obviously have no prior familiarity with these utterances. 
For poor readers this looks lo be the case: they produced the most errors on 
the pseudcword stimuli. Good readers, on the other hand, had fewer errors on 
average on the pseudoword stimuli than on the low frequency monosyllabic real 
words. The good readers appear to have benefited from the previous trials, 
getting more experienced with the task and perhaps getting more finely tuned 
to the phonetic requirements of the task (e.g., adjusting to the particular 
dialect of the speaker Thus the difference in performance between reading 
groups widened in the PSEUDO condition, again yielding significant results, 
F(1,28)«9.98, £=.O0iJ. 
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In sum, for accuracy measuronents a noteworthy difference in jDerf ormance 
was observed for the two reading groups. As we had predicted, the poor 
readers made significantly more errors on the multisyllabic and pseudoword 
conditions. The comparable effects of word frequency for both reading groups 
suggests that the perceptual problems of poor readers do not stem from 
possible differences in word knowledge. Our next step was to check whether IQ 
level might have been the underlying basis for these reading group results. 
Although the groups did not significantly differ on PPVT-R IQ scores, Crowder 
(198^) has pointed out that this may not adequately control for IQ factors. 
He argues that the size of the obtained group difference in IQ is not relevant 
in light of regression artifacts that may exist. To address this concern, one 
can test whether reading group differences in IQ might be responsible for the 
obtained results by reccmbining the subjects into high and low IQ groups. 
When this was done (high IQ: x-n3.9; low IQ: x= 98.5), the conditions that 
had revealed significant reading group effects were reanalyzed and no 
significant IQ group differences were evident (MULTI: F(l,28)«.91, £=.35; 
PSEUDO: F(1,28)-.21, £-.65). These results support the conclusion that the 
findings of speech perception differences for the good and poor readers arise 
from factors related to reading ability per se. 

Analysis of reaction time data . The mean reaction times of correct 
responses for the three stimuli sets are shown in Table 4. Reaction times are 
excluded from trials in which the response was incorrect and/or the" subject's 
reaction time was not within the limits of 200-2000 ms. 



Good 
Poor 



Table 4 

Mean Reaction Time (ms) for Correct Trials 

Monosyllabic Words Multisyllabic Words 

High Low High Low 

Frequency Frequency Total Frequency Frequency Total 



847.6 876.7 861.2 824.8 
818.3 857.4 838.8 760.7 



875.7 852.4 
788.5 772.1 



• Pseudowords 



732.6 
686.7 



As described earlier, the conditions were presented in a single order (1. 
MONO; 2. MULTI; 3. PSEUDO), which we thought would be easy for third graders 
to perform. A general observation can be made that RT values get faster for 
successive blocks of trials (as is typical for adults), overcoming the 
processing requirements imposed by greater length or reduced word familiarity. 
However, these effects can be noted within conditions, as will be described 
below. 
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The major finding for this scoring procedure is that there are no 
significant differences in RT between reading groups for any condition: MONO, 
F(1,28)-.21, £-.65; MULTI, F(l ,28)«2.80, £«.12; PSEUDO, F(1,28)«.82, £«a37. 
Further, although no group differences were significant, we were surprised 
that the poor readers, rather than being slower than good readers j were on 
average somewhat faster. This finding will be discussed below in relation to 
the error data. 
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With children as subjects, concerns might be raised about the reliability 
of reaction time data. However, the RT values suggest the subjects were 
seriously engaged in the task, and .he results indicate systematic effects of 
linguistic parameters. For example, expected word frequency effects (less 
frequent items taking long^. • to initiate) were observed on both the 
monosyllabic lists, F(1 ,28)^27.63, p<.0001, and on the multisyllabic 
condition, F( 1 , 28) =1 8.5^ , P-.0002, with no interaction of word frequency with 
reading groups (MONO: F( 1,28) = .59, £,= .^5; MULTI: F(1,28)= 1 .59, £=.22) 

Had RT differences for good and poor readers been evident, we wanted to be 
able to focus on which aspect of the repetition task might have been 
responsible: Identification of the input or articulation of the response. To 
do this, we administered the control task in which the speech identification 
process had been eliminated. Obviously, given the lack of reading group RT 
differences , the control results did not serve the original purpose. 
Nonetheless > the results do corroborate the lack of reading group RT 
differences in the word repetition tasks (monosyllabic control (/cat/) : 
F(1,28)=1.0, g>-.33; multisyllabic control (/banana/): F( 1 ,28) =2. 31 , £=.1^). 

In sum, there is no indication that the efficiency of phonetic processing 
that is represented in reaction time data differs for good and poor readers. 
We must consider, then, why the reading groups did not differ in reaction time 
performance* but did contrast on accuracy scores. Traditionally, these two 
dependent measures of performance have been viewed as alternative ways of 
studying the same underlying processes (e.g., Eriksen & Eriksen, 1979; Lappin, 
1978; Smith & Spoehr, 197^). However, evidence has been reported recently 
suggesting that speed and accuracy measures do not always reflect the same 
aspects of information processing (Santee & Sgeth, 1982). 

in the present study, speed/accuracy tradeoffs appear to be present for 
both good and poor readers. In Table 5, it can be seen that in some of the 
conditions significant negative correlations were obtained between RT and the 
incidence of errors. Our question is whether poor readers' tendency to have 



Table 5 

Correlations for Measures of Reaction Time ^nd Error Rate 

Monosyllabic Words Multisyllabic Words Pseudowords 

Good Readers -»20 +.20 -.55* 

Poor Readers -.65* -.2^ -.-48* 



raster RTs might be contributing to the observed reading group error 
differences. For the monosyllabic condition this issue doesn't arise since 
the good and poor readers were not distinguished by error rate. In the 
multisyllabic task, the nonsignificant correlations between RT and errors 
indicate that other factors are the basis of the error performance. In the 
pseudoword condition, the two dependent measures were correlated, so an 
analysis of covariance was conducted on the error data using RT as the 
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covariate. Significant reading group differences were still evident, 
F(1,27)»9.1, p=.006, again suggesting that while error and accuracy scores in 
part arise from the same processes, other factors are uniquely contributing to 
the error scores. 

To reiterate, the results indicate that poor readers are less accurate in 
phonetic processing, but are not slower. It appears that it is necessary to 
have a somewhat demanding task in order to discern reading group differences 
in phonetic ability. On the more difficult tasks, omega squared was 
calculated to determine the proportion of variance accounted for by the 
accuracy of phonetic processing. The results are as follows: MULTI = .1^; 
PSEUDO - .21. These effect sizes indicate that a fair amount of the 
performance differences between reading groups can be attributed to phonetic 
processes in perception. 

Conclusion 

In this study we looked more closely at good and poor readers' performance 
in spesch perception with nondegraded stimuli in an attempt to explore the 
basis of poor readers' short-term memory deficits. On repetition tasks, RT 
and accuracy measures were taken for monosyllabic, multisyllabic, and 
pseudoword stimuli for third-grade good and poor readers. Although there was 
no indication of reaction lime differences for the reading groups, the good 
readers were significantly more accurate than the poor readers for the more 
demanding multisyllabic and pseudoword stimuli. 

Our framework has been to consider whether differences in phonetic 
processing efficiency might be central to short-term memory function, which in 
turn plays a role in spoken and written language comprehension. Assumptions 
are being made in this approach that have been generally validated in research 
on cognitive processes. One is the assumption of a limited-capacity working 
memory system (Baddeley & Hitch, 197^). Second, within that system 
sub-processes are assumed to become more automatic with experience and to 
require less resource allocation (LaBerge & Samuels, 197^). Perfetti (1985) 
has formalized this approach in his "Verbal Efficiency Theory of Reading 
Ability" and provides a strong case for the role of the efficiency of lower 
level processes in language processing, and specifically in reading. 

Here we are examining one such lower level process, phonetic skills, to 
attempt to explicate the nature of the linguistic deficits occurring for poor 
readers on memory tasks. Given the consistent evidence of a relationship 
between speed of processing and memory span for adults (Baddeley et al., 
1975), as well as developmentally (Case et al., 1982; Hulme et al., 198^1) , it 
seemed plausible that the perception and memory deficits of poor readers might 
stem from reduced efficiency of perceptual processes and, consequently, from 
limited STM resources. Our results were mixed: the quality of responses was 
significantly less accurate for the more phonetically demanding stimuli, 
though somewhat surprisingly the poor readers were not found to he slower at 
initiating a phonetic response. In a subsequent study (Merlo & Brady, in 
preparation) this pattern has been replicated. Research by others generally 
conforms to this picture as well. For somewhat demanding speech tasks 
(speech-in-noise, Brady et al., 1983; multisyllabic words, Snowling, 1981; 
phonologically difficult phrases, Catts, 198^; tongue twisters, Merlo & Brady, 
in preparation), poor readers have repeatedly been observed to produce more 
errors. On the other hand, reaction time measures for tasks entailing 
creation of a phonetic representation (e.g., object naming, color naming, 
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digit naming, word naming) have generally not revealed reading group 
differences in hi unless the stimulus involved orthographic information (Katz 
& Shankweiler, 1986; Perfetti, Finger, & Hogaboam, 1978; Stanovich, I98l). 
However, there are some indications that differences in naming speed may be 
present with younger children or more disabled readers (Blachman, 1981; 
Denckla & Rudel, 1976a, 1976b; Spring & Capps, 197^). In toto, these findings 
suggest that the important differences in perceptual operations between good 
and poor readers rest not with the rate of processing, but with the accuracy 
of formulating phonetic representations. 

To summarize, in the present study we have extended previous observations 
of inferior perception by poor readers with speech-in-noise to perceptual 
deficits with clearly presented stimuli . These results strengthen the 
hypothesis that the memory deficits commonly observed in poor readers for 
linguistic material may derive from the perceptual requirements of the task, 
that is, from less efficient encoding of the phonetic items. 
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Footnotes 

^However, it may well be the case that low level difficulties creating a 
phonetic representation in STM may have consequences on higher processes such 
as comprehension (cf. Mann, 198^; Perfetti, 1985). 

^This fits unreported observations in previous research we've conducted 
that both good and poor readers tend to produce errors at the beginning of a 
demanding speech task, but good readers show more rapid improvement. It would 
be interesting in future work to evaluate this aspect of phonological 
competence specifically for good and poor readers. 
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PHONOLOGICAL AND MORPHOLOGICAL ANALYSIS BY SKILLED READERS OF SERBO-CROATIAN* 



Laurie B. Feldmant 



Abstract , Two distinctive properties of Serbo-Croatian, the major 
lant^uage of Yugoslavia, have been exploited as tools in the study of 
reading. First, most literate speakers of Serbo-Croatian are facile 
in two alphabets, Roman and Cyrillic. The two alphabet sets 
intersect and words composed exclusively from the subset of 
characters that occur in both alphabets can be assigned two 
phonological interpretations — one by treating the characters as 
Roman graphemes and one by treating the characters as Cyrillic 
graphemes. By exploiting the availability of two overlapping 
alphabets, the nature of phonological codes and how they figure in 
lexical access has been explored. Second, the inflectional and 
derivational morphology in Serbo-Croatian is complex, and extensive 
families of morphologically-related words exist. This complex 
morphology permits one to investigate how morphological structure is 
appreciated by the proficient language user. In the present report, 
results of a series of experiments that investigated phonological 
and morphological analysis in word recognition tasks by adult 
readers of Serbo-Croatian are summarized and discussed in terms of a 
characterization of skilled reading in Serbo-Croatian. To 
anticipate, the skilled reader of Serbo-Croatian appears to 
appreciate both phonological and morphological components of words. 

The Bialph^betic Environment 

Serbo-Croatian is written in two different alphabets, Roman and Cyrillic. 
The two alphabets transcribe one language and their graphemes map simply and 
directly onto the same set of phonemes. These two sets of graphemes are, with 
certain exceptions, mutually exclusive. Most of the Roman and Cyrillic 
letters occur only in their respective alphabets. These are referred to as 
unique letters. There are, however, a limited number of letters that are 
shared by the two alphabets. In some cases, the phonemic interpretation of a 
shared letter is the same whether it is read as Cyrillic or as Roman; these 
are referred to as common letters. In other cases, a shared letter has two 
phrnemic interpretations, one by the Roman reading and one by the Cyrillic 
Mng; these are referred to as ambiguous letters^ (see Figure 1). Whatever 
their category, the individual letters of the two alphabets have phonemic 
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Table 1 

Types of Letter Strings and Their Lexical Status 



COMPOSITION 
OF 
LETTER 

STRING PHONEMIC INTERPRETATION MEANING 



COMPOSIiION 
OF 
LETTER 

STRING PHONEMIC INTERPRETATION MEANING 





AMBIGUOUS and COMMON^-- 










BETAP 


Roman 
Cyrillic 


^etap/ 
/vetar/ 


meaningless 
wind 


VETAR 


Roman 
Cyrillic 


/vetar/ 
impossible 


wind 


POP 


Roman 
Cyrillic 


/pop/ 
/ror/ 


priest 
meaningless 


non 


Roman 
Cyrillic 


impossible 
/pop/ 


priest 


POTOP 


Roman 
Cyrillic 


/pot op/ 
/rotor/ 


flood 
motor 


ROTOR 


Roman 
Cyrillic 


/rotor/ 
impossible 


motor 


PAJOC 


Roman 
Cyrillic 


/pajotc/ 
/rajos/ 


meaningless 
nicdningiess 


noTon 


Roman 
Cyrillic 


impossible 
/potop/ 


flood 








HAJOS 


Roman 


/rajos/ 


meaningless 








Cyrillic 


impossible 




MAMA 


Roman 
Cyrillic 


/mama/ 
/mama/ 


mother 
mother 


HAJOLl 


Roman 
Cyrillic 


impossible 
/pajots/ 


meaningless 


TAKA 


Roman 
Cyrillic 


/taka/ 
/taka/ 


meaningless 
meaningless 











'Phonologlcally bivalent letter strings 
^Phono logically unequivocal controls 
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interpretations (classically defined) that are virtually invariant over letter 
contexts. (This reflects the phonologically shallow nature of the 
Serbo-Croatian orthography.) Moreover, all the individual letters in a string 
of letters, be it a word or nonsense, are pronounced — there are no letters 
made silent by context (see Feldman & Turvey, 1983; Lukatela, Popadid, 
Ognjenovi6, & Turvey, 1980; Lukatela, Savi6, Gligorevid, Ognjenovi6, & Turvey, 
1978). 2 



Serbo-Croatian Alphabet 
— Uppercase — 

/ ^ X 



Cyrillic "Common Roman 




Uniquely Ambiguous Uniquely 

Cyrillic letters letters Roman letters 



Figure 1. The characters of the Roman and Cyrillic Alphabets (printed from 
Feldman and Turvey, 1983, with permission from the American 
Psychological Association). 

Given the relation between the two Serbo-Croatian alphabets, it is 
possible to construct a variety of types of letter strings. A letter string 
that contains at least one uniquely Roman character in addition to shared 
characters wouli be read in only one way and it could be either a word or 
meaningless . A IvUter stri ng composed entirely of common and ambiguous 
letters is bivalent. That is, it could be pronounced in one way if read as 
Roman and pronounced in a distinctly different way if read as Cyrillic; 
moreover, it could be a word when read in one alphabet and meaningless when 
read in the other or it could represent two different words, one in one 
alphabet and one in the other, or finally it could be meaningless in both 
alphabets (see Table 1). 
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Consider the word that means WIND. As with any word in Serbo-Croatian, 
it can be written in either Roman characters or Cyrillic characters. In its 
Roman transcription (i.e., VETAR), the word includes unique and common 
characters and is phonologically unequivocal. By contrast, in its Cyrillic 
transcription (i.e., BETAP), the word includes only ambiguous and common 
characters and therefore is phonologically bivalent. By its Cyrillic reading 
it is a word; by its Roman reading it is meaningless. In the present series 
of experiments, two forms of the same word are compared where one is 
phonologically bivalent and the other is phonologically unequivocal. Notice 
that by comparing two printed forms of the same word, problems of equating 
familiarity, richness of meaning, length and number of syllables are 
eliminated.' To reiterate, the letter strings exemplified by BETAP and VETAR 
are the same word and, therefore, identical in all respects but one, namely, 
the number of phonological interpretations. 

Phonological Analysis in Skilled Readers 

When bi-alphabetic adult readers of Serbo-Croatian performed a lexical 
decision task (i.e.. Is this letter string a word by either a Roman or a 
Cyrillic reading?), single letter strings composed of ambiguous and common 
characters (i.e., those letter strings that could be assigned both a Roman and 
a Cyrillic alphabet reading) typically incurred longer latencies than the 
phonologically unequivocal alphabet transcription of the same word. This 
outcoTie has been reported both in a mixed alphabet context where the lexical 
interpretation of a letter string was sometimes in Roman and sometimes in 
Cyrillic (Feldman & Turvey, 1983; Lukatela et al., 1980) and a pure alphabet 
context where the lexical interpretation was always in Roman (Feldman, 1983; 
Lukatela et al., 1978). The effect of phonological ambiguity was significant 
both for bivalent words and pseudowords, but it was more robust for words. In 
characterizing the effect of ambiguity in lexical decision, several outcomes 
prove essential. First, the effect of phonological ambiguity did not vary as 
a function of word familiarity. For each word, decision latency to its 
phonologically unequivocal form was used an an index of familiarity and was 
correlated with the difference in decision latency between the bivalent and 
unequivocal forms of the word. In lexical decision, that correlation 
approached zero (Feldman & Turvey, 1983).** Second, words composed entirely of 
common letters (with no ambiguous or unique letters) such as MAMA were 
accepted as words no more slowly than letter strings that included common and 
unique letters. Likewise, pseudowords composed entirely of common letters 
such as TAKA were rejected as words no more slowly than letter strings that 
included common and unique letters. Note that the distinction between common 
and ambiguous letters derives from their phonology: each type of letter 
occurs in both alphabets but only the latter have two phonemic 
interpretations. The forepoing discrepancy of outcomes suggests that it is 
phonological bivalence raoher than a visually-based alphabetic bivalence that 
governs the slowing of decision latencies (see Lukatela et al., 1978, 1980, 
for a complete discussion). Third, lexical decision latencies to letter 
strings composed entirely of ambiguous and common letters were always slowed 
whether both alphabet readings yielded a positive response such as "POTOP'* 
(Lukatela et al., 1980) or a negative response such as "PAJOC" (Feldman, 1981; 
Lukatela et al., 1978, 1980) or the Cyrillic reading and the Roman reading 
yielded opposite responses such as "BETAP" or "POP" (Feldman & Turvey, 1983; 
Lukatela et al., 1978, 1980). This outcome invalidates a decision stage 
account of the detriment due to bivalence that posits some type of 
post-lexical interference between conflicting lexical judgments. Moreover, 
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insofar as lexical decision is alleged to be susceptible to decision-stage 
influences in a way that naming is not (Balota & Chumbley, 198^; Seiderberg, 
Waters, Sanders & Langer, 198^), it is noteworthy that the detriment due to 
bi valence is generally enhanced in naming relative to lexical decision. 
Finally, the difference in decision latency between the bivalent and 
unequivocal forms of a word increased as the number of ambiguous (but not 
common) characters increased (Feldman & Turvey, 1983). It was eliminated, 
however, by the presence of a single unique letter (Feldman, Kosti6, Lukatela, 
& Turvey, 1983). These findings imply that a segmental phonology is assembled 
from an analysis of a letter string's component orthographic structure and 
that sometimes (multiple) phonological interpretations a»^e generated. The 
foregoing results of lexical decision experiments with phonologically bivalent 
letter strings provide evidence that access to the lexicon in Serbo-Croatian 
necessarily involves an analysis that 1) is sensitive to phonology and 
component orthographic structure, 2) is not sensitive to the lexical status of 
the various alphabetic readings. These results have been interpreted as 
evidence for an assembled segmental phonology in Serbo-Croatian. 

In an attempt to understand conditions under which phon^^logical codes and 
lexical knowledge do interact in Serbo-Croatian, we have begun to explore 
associative priming of phonologically bivalent words (Feldman, Lukatela, Katz, 
& Turvey, forthcoming) . In this procedure, target words are sometimes 
presented in the context of another word that is associated with it and 
decision latencies to the target with and without its associate are compared. 
Phonologically bivalent vords and the unequivocal alphabet transcription of 
those same words were presents:! as targets in a lexical decision task. Half 
of the bivalent targets were words by the Cyrillic reading and half were words 
by their Roman reading. On some proportion of trials, target words were 
presented in the context of another word that was associati vely related to it 
and preceded it by 700 ms. Soraetines, the alphabet of the associate was 
congruent with the alphabet in which the target reading was a word. Sometimes 
the associate and the target reading were alphabetically incongruent. Results 
showed significant facilitation in the context of associates, evidence of 
lexical mediation. More interestingly, decision latencies for bivalent letter 
strings that are words by one of their alphabet readings were reduced less 
when those words are preceded by an associate printed in the other, 
incongruent alphabet than when the associate was printed in the same alphabet 
as the word reading of the target. This outcome suggests alphabetic 
congruency as a second source of facilitation. For example, bivalent BETAP, 
which means WIND when read as Cyrillic, was preceded by the word for STORM. 
Inspection of word means in Table 2 reveals that target decision latencies for 
BETAP type words were 6^ ms faster when preceded by the Cyrillic form of the 
word for STORM than by the Roman form of the same word. By contrast, target 
decision latencies for the same words written in their phonologically 
unequivocal form were facilitated equally by the prior presentation of an 
associated word printed in either the congruent or incongruent alphabet. For 
example, WIND written in Roman, namely VETAR, is phonologically unequivocal 
and decision latencies were not significantly different when the word for 
STORM appeared in its Cyrillic or Roman form. Likewise for pseudowords, 
alphabet congruency had no effect (see Table 2). 
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Table 2 



Lexical Decision (ms) to Bivalent Words and their Unequivocal 
Controls in the Context of Alphabetically Congruent and 
Alphabetically Incongruent Associates 



BIVALENT 
(BETAP) 



UNEQUIVOCAL 
(VETAR) 



ALPHABET 

OF ASSOCIATE: 



CONGRUENT 



709 



672 



INCONGRUENT 



775 



685 



(NO ASSOCIATE) 



6^5 



765 



From Feldman, Lukatela, Katz, and Turvey (in preparation) 



In summary, lexical decision latencies for phonologically bivalent letter 
strings are reduced significantly more when preceded by associates that are 
alphabetically congruent with the word reading of the letter string, than by 
associates that are not congruent. By contrast, decision latencies for 
phonologically unequivocal letter strings are not influenced by the alphabet 
of the associate. Associative and alphabetic sources of facilitation can be 
identified. Whereas facilitation by association occurs for all the words and 
is assumed to be lexical in origin, facilitation by alphabet congruency of 
associate and target was important only for bivalent letter strings. The 
special dependency of alphabetic congruency on ambiguity suggests that 
alphabetic priming and phonological ambiguity have a common origin. 

In summary, studies of phonological ambiguity indicate that skilled 
readers of Serbo-Croatian analyze words phonologically. In judging letter 
strings conposed exclusively of ambiguous and common letters for a lexical 
decision, adult readers appear to assign a phonological interpretation (or 
several) to each character (Feldman & Turvey, 1983). At the same time, the 
alphabet in which a prior occurring associate is printed appears to bias the 
generation or the evaluation of various phonological interpretations of a 
bivalent letter string. An analogous effect is absent in phonologically 
unequivocal words and in all pseudowords. 

Morphological Analysis in Skilled Readers 

The effect of phonological ambiguity has provided a means to evaluate the 
analytic skills of readers with respect to morphological components. As noted 
above, the Serbo-Croatian language, in a manner that is characteristic of 
Slavic languages generally, makes extensive use of inflectional and 
derivational morphology. A noun can appear in any of seven cases in the 
singular and in the plural where the inflectional affix varies according to 
its gender, number, and case. For example, the words STAN and KORA, which 
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mean "apartment" and "crust," respectively, in nominative case can be 
inflected into six other cases in the singular and in the plural, and 
different inflectional affixes mark each case (with some redundancy of 
affixes). Similarly, derived forms for "little apartment" or Uhin crust" can 
be generated by adding one of thg diminutive affixes (viz., CIC ICA, ENCE, AK) 
to the base word to produce STANCIC and KORICA, respectively. The prevalence 
of inflectional and derivational formations in Serbo-Croatian is evidence of 
its productiveness (see Table 3). 



Table 3 



Examples oi' Morphologically-related Words Formed with the Base 
Morpheme "PIS" Meaning "write" 



EXAMPLE 

OPIS 

OPISI 

PISEM 

PISETE 

PISAC 
PISCIMA 

PISMO 
POPIS 
POTPIS 
SPISAK 



DERIVATIONAL BASE 
PREFIX MORPHEME 



0 
0 



PO 

POT 

S 



PIS 
PIS 

PIS 

PLS 

PIS 
PIS 

PIS 
PIS 
PIS 
PIS 



DERIVATIONAL INFLECTIONAL 



SUFFIX 



SUFFIX 



AC 
C 

MO 



EM 



ETE 



IMA 



AK 



MEANI NG^ 

description 

descriptions 
(nom. plural) 

I write 
dp. sing) 

you write 
(2p. plural) 

writer 

writers 
(dat. plural) 

letter 

inventory 

signature 

list 



*all words are in nominative singular unless otherwise noted 



One way in which sensitivity to morphological constituents is construed 
is in terms of a morphological parser that operates prior to lexical access 
such that affixes are stripped from a multimorphemic word and the base 
morpheme serves as the primary unit for lexical search (see Caramazza, Miceli, 
Silveri, & Laudanna, 1985). Frequency of the base unit and the whole word as 
well as the difficulty of segmenting the appropriate base unit figure 
significantly in decision latency (Taft, 1979; Taft & Forster, 1975). In one 
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experiment (Feldman et al., I983) the effect of phonological ambiguity was 
exploited to assess whether the base morpheme or the whole word serves as the 
unit for lexical access of inflected words in Serbo-Croatian. Words were 
presented in nominative and dative case for a lexical decision. Words were 
selected so that the nominative case and the base morpheme (i.e., nominative 
minus inflectional affix for most singular nouns) were phonologically bivalent 
in the Cyrillic alphabet and phonological unequivocal in Roman. For example, 
the nominative case of the word meaning VEIN is composed entirely of ambiguous 
and common letters when printed in Cyrillic (i.e., BEHA) and is therefore 
phonologically bivalent. In Roman, by contrast, it comprises unique and 
coiranon letters (i.e., VENA) and is, therefore, phonologically unequivocal. 
Importantly, in the dative case, neither alphabet rendition is bivalent 
because the inflectional affixes for wordi^ of its class are the phonemes /u/ 
and /i/, both of which are represented by a unique letter in each alphabet, 
although the base morpheme of the Cyrillic form (i.e., BEH) is still bivalent. 

The major outcome of that experiment was a significant interaction of 
alphabet and case. The difference in latency between dative nouns presented 
in Cyrillic and Roman was -28 ms which was not significant, whereas the 
difference between nominative nouns was 30^4 ms, which was significant. In 
that dative nouns always included a unique letter, it appears that the effects 
of phonological bivalence do not occur if letter strings composed of ambiguous 
and common characters contain even one unique character. Importantly, in that 
experiment, the unique character always constituted an inflectional morpheme. 
Stated in terms of morphological units, the outcome of that experiment was 
that an inflectional affix composed of a unique character and appended to a 
bivalent base morpheme canceled the detriment due to ambiguity. Evidently, 
the reader could use the alphabet designation cf the inflectional affix to 
assign a reading to the base morpheme. In conclusion, bivalence defined on 
the word but not on the base morpheme alone slowed performance on a lexical 
decision task. This outcome indicates that lexical access of inflected nouns 
is not restricted to information in the base morpheme unit. Rather, it 
encompasses the entire word. 

An alternative perspective on a reader's appreciation of morphology 
assumes that lexical entries are accessed from whole word units and that the 
principle of organization among lexical entries or the lexical representations 
themselves capture morphological structure. The final experiment (Feldman & 
Moskovljevifi, in press) exploits the complex derivational morphology of 
Serbo-Croatian to provide further evidence that whereas the morphological 
structure of words is accessible to the skilled reader, lexical entries are 
not accessed from base morphemes. The experiment incorporated a comparison of 
three types of nouns all in nominative case: (1) base forms (e.g., STAN, 
KORA); (2) the diminutive form of those same nouns, which as described above, 
is formed (productively) by adding one of the suffixes CIC, ICA, ENCE, AK to 
the base form (e.g., STANCIC, KORICA), where choice of suffix is constrained 
by gender of the noun, and (3) an unrelated monomorphemic word whose 
construction inappropriately suggests that it contains the same base form and 
a diminutive affix (e.g., STANICA, KORAK). The latter are referred to as 
pseudodiminuti ve nouns. The example3 mean "station" and "step," respectively. 

The experimental design was a variation on the primed lexical decision 
task borrowed from S tanners and his colleagues (Stanners, Neiser, Hernon, & 
Hall, 1979) and known as repetition priming. In the present adaptation of the 
task, base forms appeared as target words preceded 7 to 13 items earlier in 
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the list by a prime that was either the identical word again in its base form, 
its diminutive or a pseudodiminutive form. Decision latency to the target as 
a function of which type of prime preceded it was examined. In addition, 
decision latencies to the first presentation of the word in its base, 
diminutWe, and pseudodiminutive forms were compared. Results are summarized 
in Table ^. 



Table H 

Lexical Decision (ms) to Target Words Preceded by Identity, 







Dimunitive, or 


Pseudodimi nutive 


Primes 


PRIME 




TARGET 


TYPE OF PRIME 


STAN 


610 


STAN 


563 


IDENTT'^V 


STANCIC 


75^ 


STAN 


585 


DIMINUTIVE 


STANICA 


718 


STAN 


609 


PSEUDODIMINUTIVE 



0 

From Feldman and Moskovljevic (in press) 



Decision latencies on primes were fastest for base forms, followed by 
pseudodiminutives and lastly, diminutives. The pattern corresponded with that 
predicted by frequency and provided no evidence that monomorphemic 
pseudodiminutive forms were slowed by an inappropriate parsing of morphemic 
structure. In addition, latencies for base and diminutive forms correlated 
significantly and neither correlated with pseudodimimutive forms. An 
examination of target latencies provided further evidence that 
pseudodiminutive words are not associated with an inappropriate base morpheme 
(and affix), whereas true morphological relationships are appreciated. 
Decision latencies to target words that were preceded by pseudodiminutive 
words were as slow as target words presented for the first time. In contrast, 
both base word and diminutive primes significantly reduced target decision 
latencies. In summary, results in the repetition priming variation of lexical 
decision showed significant facilitation for morphological relatives and no 
facilitation for unrelated pseudodiminutive words. In light of the claim that 
semantic relatedness of prime to target does not facilitate target decision 
latencies at lags as long as those introduced in the present task (Dannebring 
& Briand, 1982; Henderson, Wallis, & Knight, I98ij), the foregoing results are 
interpreted as morphological in nature. In conclusion, the present experiment 
showed that the skilled reader of Serbo-Croatian is sensitive to morphological 
structure as evidenced by the results in repetition priming, but offered no 
evidence that morphological analysis entails decomposition to a base morpheme 
prior to lexical access. 

In summary, an examination of results from lexical decision and naming 
tasks that take advantage of the bi-alphabetic condition in Serbo-Croatian 
provides evidence that skilled reading in Serbo-Croatian proceeds with 
reference to phonology. Specifically: 1) Skilled readers are slowed when a 
letter string is phonologically bivalent relative to when it is phonologically 
uneqvivocal. 2) The alphabet congruency of a prior-occurring associate can 
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speed decision latencies for phonologically bivalent (but not unequivocal) 
words. Moreover, it appears that phonological bivalence is defined on the 
entire word, not in the base morpheme alone, which suggests that 3) Skilled 
readers do not attempt lexical access from an isolated base morpheme. 
Concurrently, they consider its affix. Failure to find evidence that base 
morphemes are the units for lexical access should not be construed as a claim 
against morphological analysis by the reader, however. The results from 
repetition priming indicate that prior presentation of a morphological 
relative but not of a visually similar word facilitates decision latency to a 
target. The foregoing results support the claim that the skilled reader of 
Serbo-Croatian analyzes words both phonologically and morphologically. 
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Footnotes 

^The introduction of two alphabets into Yugoslavia reflects the influence 
of the Orthodox Church in the Eastern regions and the Catholic Church in the 
Western regions. The Cyrillic script is probably an adaptation of the Greek 
uncial alphabet of the 9th century A.D. and the Roman script is a variation of 
the Latin alphabet, which was also derived from the Greek, probably via 
Etruscan (Di ringer, l9iJ8). In both cases, the scripts had to be adjusted to 
represent sounds nol present in the Greek language and several mechanisms have 
been identified: 1) Combining two or more characters to represent a single 
phoneme such as DZ and, arguably, LJ and NJ; 2) Adding a diacritical mark to 
an existing letter to form a new letter such as C, C, S. The creation of new 
letters by inclusion of a diacritic is particularly prevalent in the 
adaptation of Roman script to languages whose repertoire of phonemes differs 
greatly from the Latin. Palatal-alveolar fricatives and affricates are 
represented in this fashion in many Slavic languages, including Serbo-Croatian 
(Wellisch, 1978); 3) Taking an existing symbol that was not used in the new 
language to represent a phoneme not present (or represented by multiple 
symbols) in the old language. For example, Roman C became /ts/ and Roman K 
remained /k/; ^) Borrowing characters from other scripts. Insofar as 
particular adaptations were made independently in each alphabet and the shape 
of some letters (e.g., D,S,R) were modified slightly in the transition to 
Latin (Diringer, 19^8), the intersection of the two alphabet sets represents a 
complex of factors. 

^One consequence of the consistent mapping of grapheme to phoneme is that 
many dialectal variations are represented in writing such that spelling as 
well as pronunciation can vary fron region to region. For example, the word 
that means MILK is MLEJKO in the dialects near Belgrade and is NiLIKC in 
dialects along the Dalmatian Coast. It is important to note that the 
orthography fully specifies segmental phonolgy but that accent 
(rising/falling; long/short) is not represented. While vowel accent may 
differentiate between two semantic interpretations of a written letter string, 
this distinction is often ignored, however, especially in the dialects of the 
large*" cities (Magner & Matejka, 1971). 
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^By law, all elementary school students must demonstrate competence to 
read and write in both alphabets. With the exception of liturgical text, 
which is relatively uncommon, the choice of alphabet is not systematic across 
genres of printed material. Therefore, it can be assumed that the Romar^ and 
Cyrillic forms of a word are equally familiar to the skilled reader. 

**In naming, however, more familiar words showed smaller effects of 
phonological ambiguity (Feldman, 1981). Analogous to claims made from studies 
with English materials (Seidenberg, 1985; Stanovich & Bauer, 1978), those 
words that are recognized more slowly and are presumably less familiar are 
more susceptible to phonological effects in a naming task than are 
less-familiar words. 
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Abstract , Two experiments were performed to examine the nature of 
handshape similarity for the 26 elements of the American manual 
alphabet. Forty deaf college students, half native (first language) 
signers of American Sign Language and half nonnative signers, 
participated in che study. In Experiment 1, subjects were asked to 
base their judgments on visual characteristics of the shapes. In 
Experiment 2, they were asked to base their judgments on aspects of 
manual shape production. Hierarchical clustering and 
multidimensional scaling analyses showed the two sets of judgments 
to be quite similar. No clear differences were found between native 
and nonnative signers in either experiment. These data provide a 
basis for the future manipulation and detection of manual coding in 
the processing of verbal stimuli. 

In recent years, there has been considerable interest in the cognitive 
processes of deaf persons (see for example, Conrad, 1979; Furth, 1973; 
Neville, Kutas, & Schmidt, 1982), frequently focusing on the use of 
speech-based and manual codes in the processing of verbal materials (Bellugi, 
Klima, & Siple, 1975; Dodd & Hermelin, 1977; Hanson, 1982a; Quinn, 1981; 
Treiman & Hirsh-Pasek, 1983). Experimentation in this area often requires an 
understanding of stimulus similarity so that confusab ility and selective 
interference can be systematically varied (see, e.g., Hanson, Liberman, & 
Shankweiler, 198^; Locke & Locke, 1971). Although several studies have 
characterized the phonetic similarity of common stimuli (e.g.. Miller & 
Nicely, 1955, for English consonants; Conrad, 196^, for letter names), an 
adequate characterization of comparable manual stimuli has not been done. 

Two different forms of manual language are used by deaf individuals in 
the course of conversation: Fingers^vdling and sign. Fingerspelling, like 
spoken languages, uses temporal sequencing of constituent elements to convey 
morphemes. The handshapes shown in Figure 1 constitute these elements. Each 
is a one-handed representation of a letter in the American Manual alphabet. 



^ Perception & Psychophysics , 1985, 38, 311"319. 
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Words are spelled out by producing these handshapes sequentially in the space 
to the side of the signer's face. ^ Although many of the shapes are similar 
to those used in sign, f ingerspelling does not use the other parameters 
essential to sign language in making linguistic distinctions (Klima & Bellugi, 
1979; Stokoe, Casterline, & Croneberg, 1965). Fingerspelling is used to 
convey specific names or words for which no sign equivalent exists and can be 
used to convey entire conversations (the Rochester method). 




^ .fe .1 




Figure 1. Drawings used as stimuli in Experiment 1. 

The present paper focuses on the visual and production similarity of the 
26 elements of the American manual alphabet. Deaf college students, both 
native (first language) and nonnative signers of American Sign Language (ASL), 
served as informants. Previous studies had examined only subsets of these 
handshapes (Lane, Boyes-Braem, & Bellugi, 1976; Locke, 1970; Stungis, 1981), 
or had used hearing subjects with limited prior f ingerspelling experience 
(Weyer, 1973).^ Experiment 1 examined the similarity of handshapes as visual 
objects. Experiment 2 examined production similarity. 

Experiment j_ 

Method 

Stimuli . Simple line drawings of the 26 handshapes of the American 
manual alphabet were the stimuli in this experiment. These handshapes and the 
letters they represent are shown in Figure 1 (note that the letters did not 
appear with the experimental stimuli). Each handshape was individually 
rendered on a card measuring approximately 2 1/2 by 3 ^/^, in. 
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Procedure . Subjects were tested Individually, At the beginning of an 
experimental session, the 26 cards were laid cut In front of the subject. The 
arrangement was random, with the constraint that each handshape appear in the 
proper orientation (i.e., the top of each handshape was always to be on the 
top). 

The subjects were Instructed to sort the handshapes into piles on the 
basis of visual similarity. The following written Instructions were presented 
to the subjects: "The 26 letters of the manual alphabet are laid out in front 
of you. Begin by looking at each handshape and paying attention to how it 
looks. Then put the handshape3 into piles, so that handshapes that look 
similar are in the same pile. You can have as many piles as you wish and you 
can have any number of handshapes in each pile. You can change your mind as 
often as you like until your arrangement seems best." The experimenter, a 
deaf native signer of ASL, discussed the Instructions with each subject in 
sign to make sure that the task was clearly understood. 

Subjects . The subjects for the experiment were 20 prellngually deaf 
students from Gallaudet College. Half were native signers of ASL (having 
learned ASL as a first language from their deaf parents) and half were not. 
The nonnatlve signers reported a minimum of 13 years' signing experience. On 
the average, they had learned to sign at the age of 6.2 years; the mean length 
of signing experience for these subjects was 18.7 years. All subjects were 
paid for their participation in this IS-min experiment. 

Results and Discussion . Table 1 summarizes the number of subjects who 
sorted a given handshape pair into the same pile. A ^core of 20 thus 
represents the maximum possible interitem similarity. To discover any 
structure Inherent in this matrix (to discover, that is, how the handshapes 
might be naturally grouped), we applied Johnson's (1967) hierarchical 
clustering procedure (after first converting the raw frequency counts to a 
dissimilarity matrix) . Separate analyses using the maximum and minimum 
methods for determining intercluster distance were conducted. Johnson 
observed that the two methods yield very similar results (at least with the 
sort of data considered in his report). This was true of the present 
experiment, in which the maximum and minimum results shared 9 of the 10 
letter-pair clusters. Johnson also noted that when the results of the maximum 
and minimum methods diverge, those obtained with the maximum method appear to 
be more interpretable. In Figure 2, we show the clusterings produced by the 
Maximum method. In this figure, similarity decreases as one goes from the top 
to the bottom and clusters are indicated by adjacent x's. Thus M and N can b^i 
seen to be more strongly clustered than A and T, which are more strongly 
clustered than C and 0, and so forth. 

It can be seen that only one cluster combines an appreciable number of 
handshapes; A, T, M, N, E, and S form a group characterized by compactness. 
Most of the remaining clusters — handshape pairs — appear to be grouped on the 
basis of a single essential similarity: For the pairs K-P, G-Q, I-J, and D-Z, 
the letters are formed by identical shapes , which differ only in orientation 
(K-P, G-Q) or movement (I-J, D-Z); the pair C-0 represents two degrees of what 
might be called hand closure, other aspects (orientatione, finger 
configuration) being the same; and the pairs V-W and B-U differ only in the 
number of fingers extending upward. 
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Table 1 



Number of Subjects Sorting Handshape Pairs Into Same Pile 
on the Basis of Visual Similarity in Experiment 1 
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Figure 2. Hierarchical clustering of the handshapes in Experiment 1 

er|c 



Richards and Hanson: Visual and Production Similarity of Handshapes 



To supplement this cluster-based description, the data were examined for 
dimensionality and spatially interpretable structure using a nonmetric 
multidimensional scaling (MDS) procedure (as developed by Kruskal, 196^, and 
Sheparcl, 1962). Although a stress analysis suggested no clearly appropriate 
dimensionality, ease of interpretation leads us to prefer the unrotated 
two-dimensional solution depicted in Figure 2. 



Figure 3» The two-dimensional MDS solution for the handshapes in Experiment 1. 

Several aspects of this solution warrant comment. We see the horizontal 
dimension as representing hand compactness with open or extended handshapes on 
the left and closed handshapes on the right. The vertical dimension ap(?ms 
best characterized as orientation of the hand's major axis with vertically 
oriented handshapes near the top and horizontally oriented ones near the 
bottom. The distribution of the handshapes within this space is somewhat fan 
like; closed handshapes cluster tightly in the orientation dimension (much 
more so than can be represented in this figure), whereas open or extended ones 
are widely dispersed. Another way to say this is that to the extent that 
closed handshapes have an orientation, it is common to them all. 
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Finally, an INDSCAL analysis of the individual dissimilarity matrices 
found no evidence for different organizations as a function of whether ASL was 
the subject's native language. This may be seen in Figure ^, which plots each 
subject's weights on the two dimensions of compactness and orientation. There 
is no evidence that the native signers (filled circles) differ from the 
nonnative signers (open circles) in their dimensional weightings. 




0.2 0-4 0.6 

Compactness 



0.8 



Figure ^. Individual subject's weightings on the two dimensions of 
orientation and compactness in Experiment 1 . Filled circles 
represent native signers of ASL, open circles represent non-native 
signers. 

Experiment 2 

In the second experiment, similarity Judgments were based on the 
essentially kinesthetic aspects of manual handshape production. To help 
ensure this, uppercase letters were used as stimuli (forcing subjects to 
generate, either overtly or covertly, the handshapes being compared at any 
point during the sorting task). Instructions emphasized that production 
similarity was to be assessed. In other respects, the second experiment was 
identical to the first. 
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Method 

Stimuli . Stimuli were the uppercase representations of the 26 letters of 
the alphabet. Each character was printed, In 30-polnt lettering, on a 3 x 5 
In. Index card. 



Procedure . The procedure was similar to that of Experiment 1. For this 
experiment, the subjects were Instructed to sort the cards Into piles on the 
basis of the similarity of the handshapes . Written Instructions were given to 
the subjects, and a deaf experimenter (the same person as In Experiment 1) 
reviewed the Instructions with subjects to ensure they were understood. The 
written Instructions were as follows: "The 26 letters of the alphabet are 
laid out In front of you. Begin by thinking about the hancishapes for each 
letter. Then put the letters into piles, so that letters thac ha/e handshapes 
that are similar to produce are in the same pile. You can have as many piles 
as you wish and you can have any number of letters in each pile. You can 
change your mind as often as you like until your arrangemer-' ^^est. 
REMEMBER TO THINK ABOUT EACH HANDSHAPE AND GROUP THE LETTEl VG TO 

SIMILARITY CF THE HANDSHAPES." 



Subjects . Twenty deaf students from Gallaudet College .^(.re h^.^f 
were native signers of ASL, the other half were not. The dr.ta ui .>ne of l;:: 
nonnative signers were eliminated fron analysis due to an apparent ra lure to 
follow the instructions (the sorting of this subject was based on the visual 
similarity of the printed letters rather than on the production similarity of 
the handshapes, as evidenced by clusters such as W-M, X-K, F-E, and A-H). The 
remaining n:.ne nonnative signers reported a minimum of 13 years' signing 
experience. On the average, they had learned to sign at the age of 5.3 years; 
the mean length of signing experience for these subjects was 17.1 years. Ail 
subjects were paid for their participation in this 15-min experiment. 



Results and Discussion . Table 2 summarizes the number of subjects who 
sorted a given handshape pair into the same pile (19 being the maximum 
possible similarity score). As in the first experiment, these data were 
subjected to a hierarchical clustering analysis; th^ result is shown in Figure 
5. 

In slight contrast to the first experiment, these data appear to possess 
less global structure. In particular, the compact handshapes (A, T, M, N, E, 
A, S) exhibit no tendency to cluster as a single group. Rather, two smaller 
clusters emerge, each being describable in production-relevant terms: The E, 
A, and S handshapes share the position of the four fingers, differing only in 
thumb placement relative to the finger group; the M, N, and T handshapes 
differ only in the number of fingers extended over the thumb, with M having 
three, N having two, and T having only one. With this exception the remaining 
clusters—pairs once again—are primarily grouped as before. This similarity 
between the results of Experiments 1 and 2 is further supported by a 
moderately large correlation between the matrices shown in Tables 1 and 2 (r « 
.66, df - 323. P < .01). 
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Table 2 



Number of Subjects Sorting Handshape Pairs Into Same Pile 
on the Basis of Production Similarity in Experiment 2 
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Figure 5. Hierarchical clustering of the handshapes in Experiment 2, 
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The dimensionality and spatial structure of these data were next analyzed 
using an MDS procedure. The two-dimensional solution shown in Figure 6 
exhibits many of the same characteristics as before. The horizontal dimension 
represents hand compactness with open or extended handshapes on the left and 
closed handshapes on the right. The vertical dimension represents orientation 
of the hand's major axis with vertically oriented handshapes near the top and 
horizontally oriented ones near the bottom. And although the distribution of 
the handshapes within this space is somewhat more uniform than in Experiment 
1, we view the two solutions as essentially similar. 

Further evidence of this similarity derives from a comparison with the 
solution obtained by Weyer (1973) for visual handshape confusability. 
Although Weyer did not choose to interpret the two dimensions of his solution, 
they correspond to the dimensions of compactness and orientation found here. 
Moreover, the distribution of handshapes within the space is very similar to 
the distribution shown in Figure 6 (with the only significant exception being 
a left-right reflection of the compactness dimension). We conclude from this 
that production similarity and visual similarity are structurally similar. 

Finally, we found no evidence for different organizations as a function of 
whether ASL was the subject's native language. An INDSCAL analysis suggested, 
as before, that the groups were similarly dispersed within the space of 
dimensional weights. This is shown in Figure 7. 

General Discussion 

In the two experiments reported here, visual and production similarity for 
the handshapes of the American manual alphabet were determined to be 
essentially the same. For both sets of judgments, the dimensions of hand 
compactness and orientation were found to describe the data. And for both 
sets of judgments, similar numbers and arrangements of handshape clusters 
emerged. We conclude from this that judged handshape similarity is relatively 
unaffected by the modality to which the judge attends. The present results 
also suggest that at least within the range of relatively skilled signers, 
perceived handshape similarity does not vary as a function of degree of 
experience with f ingerspelling; we found no differences between native and 
normative deaf signers. 

The present data are quite consistent with earlier studies of perceptual 
confusability. They are in accord with results reported for the subset of 
manual alphabet handshapes included in the ASL studies of Lane et al. (1976) 
and Stungis (1981). In these two studies, the major differentiating feature 
was whether fingers were extended (open) or not extended (compact). 

The present data are also in accord with the results obtained by Weyer 
(1973) for the entire manual alphabet. Weyer investigated the confusions that 
emerged during taohistoscopic recognition of computer-generated handshapes. 
His clustering analysis indicated that the largest cluster was composed of the 
N, S, T, and A handshapes, with an adjacent cluster composed of the E, M, and 
0 handshapes. These handshapes, characterized by Weyer as involving fists and 
folding fingers, are the same as those found by our analyses to be "compact." 
Some of the smaller clusters found by Weyer were also apparent in the visual 
similarity data of the present Experiment 1, e.g., B-U, V-W, and I-J. Some 
differences did arise, however, in specific clusterings of handshape pairs. 
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Figure 6. The two-dimensional MDS solution for the handshapes in Experiment 2. 
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Individual subject's weightings on the two dimensions of 
orientation and compactness in Experiment 2. Filled circles 
represent native signers of ASf^, open circles represent non-native 
signers. 
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We found, for example, that our deaf subjects judged, as visually similar, 
pairs that had similar shapes but differed in orientation (K-P, G-Q) or 
movement (D-Z). These groupings were not obtained by Weyer. Such differences 
might be attributed to procedural variation (Weyer used tachistoscopic 
recognition; we used a sorting task), or to differences in the handshape 
stimuli used, or to subject differences. To the extent that the differences 
are reliable, we suspect that subjects' differing familiarity with the manual 
alphabet underlies them. Twelve of the 15 subjects in Weyer's study were 
hearing, and the level of f ingerspelling expertise was given for none of the 
subjects. It is possible that his hearing subjects were totally unfamiliar 
with f ingerspelling prior to the experiment. If so, they would have tended to 
rely on visual features, whereas our more experienced subjects might have 
allowed their knowledge of handshape production (e.g., K and P are the same 
handshape, just oriented differently) to influence tl^eir judgments. 

The present data are consistent, moreover, with patterns of interletter 
confusion obtained in tasks requiring the short-term retention of pr:: nted 
letter strings. Two studies, one by Conrad and Rush (1965) and another by 
Wallace and Corballis (1973), exaT^ined short-term retention by deaf subjects 
with manual language experience (and published the raw confusion matrices 
needed here). Of these two, only the one by Wallace and Corba-tlis included, 
in the stimulus set, a high proportion of letters found by our techniques to 
be manually similar.' From this fact alone we might expect that the 
confusion data of Conrad and Rush would be less influenced by manual 
similarity than the data of Wallace and Corballis. The correlations 
summarized in Table 3 are in line with this expectation (note that the results 
in thi£ table were derived by correlating the interletter confusion matrices, 
collapsed across conditions within each of the two studies, with the subset of 
our manual similarity matrices containing the letter subset used in each of 
the two studies). We find an interpretable pattern of correlations within the 
conditions of the Wallace and Corballis study as well. In Table ^, separate 
correlations are shown for stimulus strings of length ^ and 5 for subjects 
with manual training and for those with oral training. The higher 
correlations for the longer stimulus strings may well correspond to a greater 
reliance on language codes in short-term memory. The higher correlations for 
the .nual subject group may well reflect a greater tendency to associate the 
printed letter strings with the corresponding handshapes (a tendency made all 
the more likely by their history of instruction in the Roches^>^r method — a 
technique in which all v/ords are f ingerspelled) . These two trends are even 
more apparent in the right half of the table. Here we show the correlations 
between confusion and similarity matrices from which the letter pair G-Q has 
been excluded. Since Wallace and Corballis noted that the lowercase forms of 
their stimulus letters G and Q were highly similar visually (differing only in 
a right- versus left-hooking descender), and since these two letters are also 
quite similar manually (same handshape in different orientation), this 
exclusion affords a clearer picture of the relationship due to manual 
similarity alone. 

The present results are not consisn^nt with the production similarity data 
obtained by Locke (1970). Locke found the following pairs of handshapes to be 
rated as the most similar kinesthetically : K-P, B-Y, F-B, R-P, T-V, and X-K. 
Of these pairs, our subjects judged only the pair K-P to be highly similar. 
The pair F-B was judged to be only rrrclerately similar. It is likely that the 
limited set of nine letters used by Locke, combined with a forced choice 
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Table 3 

Correlations of STM Confusion tetrices with Manual Similarity Matrices 

Manual Similarity Matrix 



STM Confusion Matrix 


Visual 


Production 


Combined 


Conrad & Rush (1965) 


-.17 


.06 


-.08 


Wallace & Corballis (1973? 




.51** 


.50** 


(** significant at .01 Ifv^i 


or 'tv">tt9r) 







Table k 

Correlations of STK Confusion Matrices (fro:-?: ;s'allace & Corballis, 1 973) 

with COfnblned Manual Similiarity iMatrix 

Letter Set 

STM Confusion Matrix Including G-Q Excluding G-Q 

Manually Trained Subjects 

List Length ^ .38** .07 

List Length 5 .^8** .i^!5** 

Orally Trained Subjects 

List Length k .29* -.06 

List Length 5 .37* .20* 

(* significant at .05 level or better, ** at .01 level or better) 



O 62 



ERIC 



Richards and Hanson: Visual and Production Similarity of Handshapes 



methodology, imposed a set of similarity relationships unrepresentative of the 
larger set of handshapes. It is also possible that subjects misinterpreted 
his instructions. Consider, for example, that the letter pair T-V was rated 
highly similar by Locke's subjects (in contrast to Weyer's study and the 
present one, which are the only other studies to include both the T and V 
handshapes). This letter combination is frequently produced by deaf 
individuals (in referring to television) and is quite easy to produce as a 
rapid sequence. If such an "ease of co-production" criterion was adopted by 
Locke's subjects, :here would be little reason to expect our results to be 
similar. 

In summary, our characterization of handshape similarity appears reasonably 
stable across both judgment modality and degree of experience. It is 
consistent with previous work in percept' 1 confusability , and is -^e]a*'^d in 
straightforward ways to patterns of .nter letter confusion in -term 
memory. Future experiments can draw on these results either to manipulate 
systematically or to detect the use of manual codes in the processing of 
verbal stimuli. 
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Footnotes 



^In skilled fingerspelling, letters of words are neither produced nor 
r-ecogj ized isolated letters. Rather, one finds evidence for coarticulatory 
effects in production (Reich, 197^) and facilitation of recognition in 
familiar clusters (Hanson, 1982b; Zakia & Haber 1971)* 

^The subjects in Weyer 's experiment were 12 hearing subjects and 3 deaf 
subjects. Since the data of the deaf and hearing subjects were not presented 
separately, we do not know to what extent the overall characterization is 
representative of the deaf users of the language system. 

^The study by Conr^-^d and Rush (1965) used only 9 different letters: B, F, 
K, P, R, T, V, X, and Y. The study by Wallace and Corballis (1973) used only 
10 letters: A, B, D, E, G, H, N, Q, R, and T. If we look at the visual and 
production similarity judgments obtained in the present study, it can be seen 
tnat the lette.'^s used by Conrad and Rush are relatively low in rated 
similarity (with the exception of K and P, which are moderately similar). The 
letters used by Wallace and Corballis havt^ several pairs that were found by 
our techniques to be manually similar (namely, A-E, A-N, A-T, N-T, and G-Q). 
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SHORT-TERM MEMORY FOR PRINTED ENGLISH WORDS BY CONGENITALLY DEAF SIGNERS: 
EVIDENCE OF SIGN-BASED CODING RECONSIDERED 



Vicki L. Hanson and Eo-jrd H. Lichtensteint 



Abstract , Shand (1982) found that deaf signers' recall -^f lists of 
printed English words was poorer when the American Sign Language 
translations of those words were structurally similar than when they 
were structurally un;*elated, H2 presented these results as evidence 
of sign -based coding of printed words. This cone lusion is 
challenged by the present finding that a group of hearing subjects, 
who were tested on Shand 's stimuli and were unfamiliar with sign 
language, showed similar performance decrements on the lists of 
words having structurally similar signs. Alternative accounts of 
these findings for both hearing and deaf subjects are discussed. 

The nature of short-te^m memory coding of printed words by deaf 
individuals is of considerable importance, both for theoretical and practi .ul 
reasons. Theoretically, investigations in this area can provide insight into 
the role of speech coding in short-term ordered recall. Does the use of a 
speech code by hearing individuals derive fron their experience with speech as 
a primary means of communication (Shand, 1982; Shand & Klima, 1981)? Or does 
speech coding provide an effective way of storing ordered information due to 
the highly sequential character of spoken language (Baddeley, 1979; Crowder, 
1978; Healy^ 1975)? If due to the primary use of a spoken language to 
oommunJ cate, then a language code rooted in another modality (e.g., a code 
^^3'^c\ on vlsual/gest:;ral sign language of deaf individuals) should be as 
effective a ouoc for short-term ordered recall as a speech code. If, however, 
speech is an effective code for ordered recall due to its sequential 
properties, then deaf signers (who, in general, have received some speech 
instruction) may not be inclined to recede into signs, owing to the fact that 
signs involve simultaneous structuring of linguistic elements to a much 
greater extent than does speech (Klima & Bellugi, 1979). 

In addition to these concerns, practical educational issues call for 
research on short-term memory coding by deaf individuals. In particular, 
research in this area addresses issues of coding in reading, such as whether 
deaf children can use speech coding in reading and whether sign coding can be 
used as an effective alternative to speech coding for deaf children in the 
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acquisition of reading (see, for example, Conrad, 197?: Hanson, Liberman, & 
Shankweiler, I98i4). 

The presentation of American r>ign Language (ASL) signs to deaf subjects 
for ordered recall has coniii^tently provided evidence that information is 
coded in terms of the formational parameters (cheremes) of the signs; thus, 
presenting lists of f ormationally similar signs has lead to decrements in 
serial recall (Hanson, 1982; Poizner, Bellugi, & Tweney, 1981; Shand, 1980, 
1982), and presenting lists of unrelated signs has lead to intrusion errors in 
which the incorrect items are formationally similar to the original signs 
(Bellugi, Klima, & Siple, 1975; Krakow h Hanson, 1985). In contrast, less 
consistent outcones have been reported in experiments in which printed words 
have been presented to deaf subjects. Shand (1980, 1982) reported decrements 
in recall of printed words whose corresponding signs were formationally 
similar to other signs within the same list. However, the use of similar 
procedures by other researchers failed to obtain this same outcome, even when 
testing native signers of ASL (Hanson, 1982; Lichtenstein, in press). 
Furthermore, no evidence has been obtained of sign-based intrusions in deaf 
signers' recall of printed words (Krakow & Hanson, 1 985 ). The failure to 
obtain evidence of sign-based coding with printed words cannot be attributed 
to insufficient power in the experimental design: Two of the studies that 
failed to find evidence of sign-based coding when printed words were presented 
found evidence of sign-based coding when signs were presented (Hanson, 1982; 
Krakow & Hanson, 1985). 

The present paper focuses on the work of Shand (1980, 1982) in an attempt 
to resolve the discrepancy between his results and those reported by other 
investigators. A resolution of this issue is highly desirable given the 
importancv^ of such findings for theories of short-term memory and their 
pedagogical implications. 

Shand *s procedure (following Baddeley, 1966) involved the use of 
experimental sets of words chosen to be similar along a given dimension. For 
this purpose, Shand had a phonetically similar set of words (SHOE, THROUGH, 
NEW, SHOW, NO, SEF, THREE, SEW) and a cheremically (sign) similar set of words 
(CANDY, ONION, APPLE, JAPANESE, JEALOUS, CHINESE, SOUR, BORED). The signs 
corresponding to each of the words in the cheremically similar set are 
similarly formed. Accuracy on lists of words taken fran the experimental sets 
is compared, in this procedure, with accuracy on lists of words taken from a 
control set. As controls, Shand used four words fran the phonetically similar 
set and four words from the cheremically similar set. The resulting set of 
words (SHOE, THROUGH, APPLE, JAPANESE, NO, SEE, SOUR, BORED) allowed for a 
comparison of accuracy of specific words when they were presented in the 
experimental set vs. when they were presented in the control set. Thus, each 
of the words in the control set was matched with a word In one of the 
experimental sets. 

The subjects in Shand 's study were eight congenitally , profoundly deaf 
signers of ASL. Three were native signers of ASL; the other five had a 
minimum of seven years of signing experience. On each trial, the subjects saw 
five of the words from a word set and were asked for immediate written ordered 
recall of these words. Shand found, both with list scoring (percent lists 
recalled perfectly) and with item scoring (percent words correctly recalled in 
the correct position), that the deaf subjects had poorer recall of words with 
presentations from the cheremically similar set than from the control set. 
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This same finding was obtained when analyzing only those four words that were 
common to the cheremic and control sets; recall of these words was less 
accurate when they were presented in lists from the cheremically similar set 
than when presented in lists from the control set. In contrast, there was no 
significant difference in accuracy between performance on the phonetically 
similar lists and the control lists for either list or item scoring. 

Shand concluded that these results provided evidence for sign-based 
coding of printed wordo. However, confounds within his stimulus sets lead us 
to question this conclusion. First, semantic associations occurred among the 
words in the cheremically similar set (e.g., CHINESE- JAPANESE, 
CANDY-APPLE-ONION), of the sort that have been found to produce decrements in 
short-term memory for hearing subjects (Baddeley, 1966). In addition, 
Lichtenstein (in press) noted that the cheremically similar words had more 
letters (21? more), more syllables (36? rore), and less visual distinctiveness 
(in terms of the range of number of letters per word) than the control words. 

Shand reported no data from hearing subjects on his task. However, he 
stated that pilot studies with hearing subjects revealed recall decrements for 
the phonetically similar lists relative to the control or cherological lists. 
He did net state whether or not the hearing pilot subjects demonstrated recall 
decrements for the cheremically similar lists relative to the control lists. 
In view of the potential stimulus confounds noted above and the implications 
of his results, such a control group is vital. 

Reported here are the results of liearing subjects tested in the 
short-term memory task of Shand. We us^d the same stimuli and procedures, 
making only one modification in the experimental design: Due to the fact that 
hearing subjects, in most short-term memory tasks, are able to recall more 
items than dear* subjects, we increased the number cf words presented on a 
trial fron five to six in an attempt to keep our error percentages roughly 
comparable to those obtained with the deaf subjects tested by Shand. 

Method 

Stimuli . Th3 stimuli were the three word sets used by Shand (1980, 
1982), as given above. 

Procedure. On each trial, subjects were presented with six words from 
one of the three sets. The words were serially presented, at a 1 s 
presentation rate, on a computer controlled CRT display. They were presented 
in uppercase letters. There were 16 trials of words fron each set, with each 
word occurring an equal number of times in each serial position. Trials were 
blocked, such that subjects saw all 16 lists from one set of words before 
proceeding to a different set. The subjects were tested individually, and the 
order of set presentation was varied between subjects. During the testing on 
lists from a given set, twe eight words o: that set were dispJayed, c^] index 
cards, for the subjects. Each set was typed, jn different orders, on two 
index c;.. ds; some of the subjects saw the first ordering of words, while other 
subjects 3aw the second. 

The instructions, spoken by the experimenter, informed subjects that they 
were to watch each of the six words presented on a trial, and to write their 
responses when the signal (***) apoeared at the end of the trial. They were 
told to write down the words in ciie serial oositiori in which they occurred on 
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answer sheets that had the serial positions numbered 1-6 for each trial. The 
experiment was self-paced allowing subjects to initiate each trial by a key 
press on the computer keyboard. 

Subjects . The subjects were eight normally-hearing college students in 
the New Haven area. They were paid for their participation in this iJ5-min 
experiment. None reported any familiarity with signs. 

Results 

Shown in Table 1 are the mean percentages of correct recall for both list 
scoring and item scoring. For comparison purposes, the results of the deaf 
subjects tested by Shand are also given in Table 1. As can be seun from the 
Table, the magnitude of the difference in accuracy between the cheremically 
similar set and the control set for these hearing subjects is comparable to 
that of Shand 's deaf subjects. With list scoring, the accuracy was 23% less 
for the hearing subjects and l8it less for the deaf subjects. With item 
scoring, the accuracy was B% less for the hearing subjects and 92 less for the 
deaf subjects. Analyses on the percentage correctly recalled by the hearing 
subjects confirmed that this performance difference between the cheremically 
similar lists and the control lists wa:: significant. Analyses of variance 
indicated significant main effects of condition for both list scoring 
F(2,1i|) - n.79, £ < .01, MSe - 20^1.93, and item scoring, F(2,^^) « 13.57, 
£ < .01, MSe « 3^.13. Post hoc tests revealed a significant difference in 
accuracy between the cheremically similar ^nd control lists for both scoring 
procedures (Newman -Keu is, £ < .05). 



Table 1 

Percentage Accuracy in Recall of Lists of Printed English Words. 

Lists recalled Items recalled correctly in 

perfectly (?) the correct position {%) 



Stimulus Set Hearing Deaf ^ Hearing Deaf ^ 

Control 77 62 91 86 
Cheremically 

similar 5^* 83* 77* 
Phonetically 

similar H'i* 59 76* 8^ 
Note: ^from Shand (1982); * significantly different from control 



The data of the hearing subjects differed from those of the deaf subjects 
in terms of the relative accuracy on the phonetically similar set. Consistent 
with Shand 's statement regarding his hearing pilot subjects, the hearing 
subjects in the present- study had difficulty in the recall of words from the 
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phonetically similar sets. Post hoc tests revealed a significant difference 
in accuracy between the phonetically similar and control lists with both list 
scoring and Item scoring ( Newman -iCeu is, £ < .05). These hearing subjects, 
also consistent with Shand^s observation regarding his hearing pilot subjects, 
were less accurate on the lists from the phonetically similar set than on 
lists from the cheremlcally similar set. This difference was statistically 
significant with Item scoring (Newman-Keuls, £ < .05), but not with list 
scoring (Newman-Keuls, £ > .05). 

Shand reasoned that using four words each frcm the phonetically and 
cheremlcally similar sets as the words In the control set would allow these 
matched items to serve as their own control; the ability to recall these 
matched Items could be compared when they occurred In lists from the 
experimental set vs. when they occurred In lists frcm the control set, 
allowing for a determination as to the relative ability to recall particular 
words as a function of list typer The percentages of Items correctly recalled 
on the four matched words In the control and experimental sets are given In 
Table 2. The error pattern on these matched words Indicated recall decrements 
on both the cheremlcally similar lists, t{7) - ^.25, £ < .01, and the 
phonetically similar lists, t_(7) - 2.72, £ < .03 (both tests two-tailed). 
Thus, for these subsets of the stimuli, as for the full set of stimuli, recall 
was less accurate when words occurred in experimental lists than when they 
cccurred in control lists. 



Table 2 

Percentage Accuracy In Recall of Items that Appeared in Both 
the Control Set and an Experiment Set. 

Experimental Set 



Stimulus Set 
Control 
Experimen tal 



Phonetically 
similar {%) 

Hearing Jeaf ^ 

90 86 

77* 85 



Cheremlcally 
similar (?) 

Hearing Deaf ^ 

92 86 

8^^ 78* 



Note: ^from Shand (1982 ) f significantly different frcffi control 
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Discussion 

The results reported here c^il into question Shand 's (1982) conclusion 
that the deaf subjects' recall decrement on his cheremically similar lists of 
printed words can be taken as evidence of sign-based coding. When the same 
word lists were presented here to non-signing hearing subjects » their 
performance showed a decrement as well; a finding that, in this case, clearly 
cannot be attributed to sign-based coding. Rather, the greater semantic 
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relatedness, word length, visual similarity, or number of syllables* of the 
words In Shand's cheremlcally similar than control sets are likely to have led 
to the decrement for the hearing subjects. The same factor(s) my also have 
been responsible for the recall decrement for the deaf subjects. 

However, the comparable decrement on the cheremlcally similar lists for 
both deaf and hearing subjects does not rule out the possibility of different 
underlying causes for these two subject groups. For example, it seems that 
Shand*s deaf subjects would have been "^ess likely than the present hearing 
subjects to have been influenced by the number of syllables in words. 
Although some deaf individuals, even native signers, have been found to use 
speech-basc>(i r::i1lng in short-term ordered reca.U (Conrad, 1979; Hanson, 1982; 
Lichtensteln, \n press; Shand, 1980), these same studies have found that 
certainly not all deaf individuals do. The use of a speech code by 
prelingually, profoundly deaf persons appears to be related to a number of 
variables, including English proficiency and speech skill (Conrad, 1979; 
Hanson et al., 198^4; Lichtenstein, in press). Shand's (1982) deaf subjects, 
as a group, showed no significant effect due to phonetic similariuy; the 
present hearing subjects, however, did. 

Although the possibility remains that the recall decrement for Shand's 
deaf subjects on the f ormationally (cheremlcally) similar lists was due to 
sign coding, the comparable results obtained here with hearing subjects 
clearly undercut Shand's argument. His conclusion must also be considered in 
light of the fact that his results are inconsistent with other studies in the 
literature in which deaf college students have served as subjects: Other 
studies using similar procedures (but different word sets) have shown 
performance decrements by deaf signers in the serial recall of formationally 
similar signs (Hanson, 1982; Poizner et al., 1981), but not in the serial 
recall of printed words having formationally similar signs (Hanson, 1982; 
Lichtenstein, in press). Moreover, although sign-basea intrusion errors have 
been found in the serial recall of unrelated lists of signs (Bellugi et al., 
1975; Krakow & Hanson, 1 985), sign-based intrusions have not been found in the 
recall of lists of printed words (Krakow & Hanson, 1985). Taken together, 
these facts argue against Shand's conclusion that his results can be taken as 
evidence of sign-based coding of printed words. 
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MORPHOPHONOLOGY AND LEXICAL ORGANIZATION IN DEAF READERS* 



Vicki L. Hanson anci Deborah Wilkenfeldt 



Abstract , Prelingually , profoundly deaf individuals, due to their 
hearing impairment, would not be expected to have the same access to 
phonological information as hearing individuals. They might 
therefore have difficulties in using phonological structure to 
relate different morphological forms of words. Deaf and hearing 
readers' sensitivity to the morphological structure of English words 
was tested in the present study by using a lexical decision 
(word/nonword classification) task. Target words were primed ten 
trials earlier by themselves (e.g., think primed by think ) , by 
morphologically related words (e.g., think primed by thought ) , or by 
orthographically related words (e,g, think primed by thin ) , 
Response times of both hearing and deaf college students to target 
words were facilitated when primed by themselves and also when 
primed by morphological relatives. Response times of subjects in 
neither group were facilitated to targets primed by orthographically 
related but morphologically unrelated words. These results Indicate 
that deaf readers, like hearing readers, are sensitive to underlying 
morphophonological relationships among English words. 



An appreciation of the morphological structure of English words has been 
demonstrated experimentally for hearing readers. In reading tasks, it has 
been shown that responses to words are facilitated by prior presentation of 
morphologically related words (Fowler, Napps, & Feldman, 1 y Murrell & 
Morton, 197^; Stannere, Neiser, Hernon, & Hall, 1979). Thus, for example, the 
word walk is more readily recognized when a morphologically related word such 
as walks, walking, or walked preceeds it than when no such morphological 
relative i^^^ile facilitative effects due to priming by a semantic 

associate (e.g*, doctor primed by nurse ) generally appear not to persjst 
beyond immediate testing (Dannenbring & Briand, 1982; Henderson, Wallis, & 
Knight, 198^), facilitative effects due to priming by a morphological relative 
have been found to persist for lags of at least ^8 items (Fowler et al,, 
1985). Facilitation due to priming by a morphological relative is plausibly 
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attributed to a particular organization of the reader's internal lexicon in 
which morphologically related words are stored closely together (Fowler et 
al., 1985; MacKay, 1978; Stanners et al., 1979; Taft & Forster, 1975). 

Not all morphologically related words have a common pronunciation and 
spelling of the shared morpheme (e.g,, f ind ^ found ) , yet previous research 
suggests that hearing readers organize members of such disparate word 
"families" together in their lexicons (Fowler et al., 1985; cf. Stanners et 
al., 1979). This Is apparently due to theJr ability, as speakers of English, 
to make use of the rules of the phonological component of the grammar, which 
relate underlying forms that are similar at the abstract level of 
morphophono logical representation to forms that are different at the more 
concrete level of phonetic representation (see Chomsky & Halle, 1968). 

The orthographic conventions of English appear to capture similarities at 
the morphophonological level and to exploit speakers* knowledge of the rules 
that render differences at the phonetic level. Thus, for example, the 
orthography represents the vowel in wide in the same way as the vowel in width 
(i.e., by the letter 1_) and represents the vowel in heal in the same way as 
the vowel in health (i.e., by the letters ea ) , reflecting the fact that at the 
abstract morphophonological level they are presumably the same. The speaker 
of English knows that in the first case the letter i_ represents the phone [ay] 
while In the second case it represents [I]. Note that no^ all orthographic 
similarities reflect morphophonological relationships (for example, cat is not 
related to catalyst ). The evidence indicates that words are not stored 
closely together by virtue of their orthographic similarity alone in the 
mental lexicons of hearing readers (Feldman, in press; Murrell & Morton, 197^; 
Napps & Fowler, submitted). 

It may not be surprising that hearin g readers of English are able to make 
efficient use of an orthographic ^*-^m that presupposes a knowledge of the 
underlying structure of the language, jwever, it can reasonably be asked 
whether deaf readers are able to inake similarly efficient use of this 
orthographic system. Since deaf readers go not come to the task of learning 
to read English with the same experience with English phonology that hearing 
readers do, it is not clear whether they are able to take advantage of 
morphophonological relationships captured by I ;ie orthography. The present 
experiment investigates whether prelingually , profoundly deaf readers are able 
to acquire the knowledge of English phonology necessary to perceive the 
morphological relationships among written uords that are observed in the 
orthography. 

H technique that has been used to study morphological effects on lexical 
organization is that of repetition priming. This technique requires subjects 
to make a word/nonword response to each item during continuous presentation of 
letter strings. Lexical decision response times thus obtained are typically 
faster to the second presentation of a word than to the first (Forbach, 
Stanners, & Hochhaus, 197^), and are also typically faster to a word that has 
been preceded by a morphological relative (Stanners et al., 1979). This first 
type of facilitation will be referred to as identity priming; the second type 
as morpholep >^iming. The present study uses the repetl tion pri mi ng 

technique ^e response times (RTs) to the first presentation of a word 

(e.g., thi T3 to the same word when it has been preceded ten trials 

earlier t ( think primed by think ) , by a morphologically related word 

^®*8-f thir.K , d by thought ) , and by an orthographlcally related word 
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(e.g., think primed by thin ). Although RT facilitation of th-s sort indicative 
of lexical effects have not bee'", found when a word is preceded by an 
orthographically similar word (Napps & Fowler, submitted), it is possible that 
sveh facilitation will be obtained for deaf readers. In fact, this is exactly 
what would be expected if deaf readers fail to perform linguistic analyses of 
worCs, and instead, or in addition, organize words in their lexicons according 
to orthographic features. 

Two types of morphological relatives are presented in this study: 
irregularly inflected forms (e.g.., mouse - mice ) and derived forms (e.g., 
prove - proof ) . Word .oairs thus related differ phonetically and exhibit less 
orthographic overlap than do regularly inflected forms. As a result, readers 
can rely on neither phonetic nor orthographic similarity exclusively in 
recognizing the morphophonological relationships that hold between the members 
of each pair. Access to the underlying morphophonological representations 
would be necessary. For hearing readers, significant facilitation of words 
preceded by both irregularly related inflections and deri vationally related 
words has previously been obtained (Fowler et al., 1985). 

In the present experiment, the performance of deaf and hearing ::ollege 
stud^^nts is compared. Ic iihould be borne in mind that the deaf college 
students who served as subjects represent the more advanced readers among the 
deaf population. These subjects were not tested in order to find out how deaf 
readers, iji general , read, but rather to determine whether sensitivity to the 
underlying morphophonological relationships among words is possible at all in 
the presence of prelingual, profound deafness. A similar pattern of results 
for the hearing and deaf subjects would suggest that subjects in the two 
groups have a similar organization of their mental lexicons. 

Method 

Stimuli 

WorQ triples were constructed consisting of a target word paired with 
both a morphological relative and an orthographically similar but 
morphologically unrelated word (e.g., think - thought - thin ) . Tf>e target 
words (e.g., think) and their orthographically similar primes (e.g., thin ) 
always had at least the first three letters in common. 

Preliminary lists of these triples were given to four deaf students from 
Gallaudet Co' who were asked to indicate any words on the list that they 

did not know, il stimulus lists were then constructed that excluded word 

triples from the preliminary list having one zv more words that were judged to 
be unfamiliar. 

The final list consisted of 2^ word triples. For 14 of these triples, the 
morphological relative was an irregularly inflec«ved form. For ten of the word 
triples, the morphological relative was a deri vationally related form. The 
target words were generally high in frequency of occurrence in written 
Englisfi! m had a frequency of at least 100 per million words, 3 had a 
frequency of at least 50 per million, and the remaining 7 had a mean frequency 
of 27.6 per million (Thorndike & Lorge, 19^4). A listing of the stimulus 
words is given in the Appendix. 



82 



76 



Hanson and Wilkenfeld ; Morphophonology and Deaf Readers 



Throughout the full experiment, each target word appeared once in each of 
three prime-target conditions. By appearing in each of these conditions, each 
target word served as its own control. The three prime-target conditions were 
(1) identity prime, in which tiic; target word served as both target and prime 
(e.g., think being primed by think); (2? morphological prime, in wh^ch the 
target word was primed by a morp .ologioally related word (e.g., think being 
primed by thought); and (3) orthographic prime, in which the target~^75Fd was 
primed by an orthographically similar word (e.g., think being primed by thin). 
Although there was obviously some orthographic overlap between the Target 
words and their morphological relatives, the orlho.r^raphic overlap was less 
than that between the target words and their orthographic primes. The 
morphological primes had 2.13 letters in common with the target words 
(considering only cor-on letters in the same word position); the orthographic 
primes had 3-^2 letters in common with them. This difference was significant 
t(23) - 7.85, £ < .001. 

Three experimental test lists were constructed so that in each list every 
target was tested in only one condition. Eight of the target words appeared 
in the identity prime condition, eight in the morphological prime condition, 
and eight in the orthographic prime condition. Each target followed the prime 
hy a lag of ten items. In addition, there were twelve filler words per list. 

To serve a.^ an Index of episodic (memory) effects, nonwords in the 
experiment ; generated by r>epldcing the initial consonant or consonant 

cluster of eaun word with another consonant or consonant cluster that made the 
letter string a nonword. For example, the nonword counterparts of the word 
triple less - least - lesson were dess - deast - desson. In list 
construction, the nonwords were treated eimilarly to their word counterparts, 
with each of the target nonwords preceded ten trials earlier by an identity 
prime, a morphological nonword prime, or an orthographic nonword prime. The 
final lists eaoh cc'-^ained 120 items, 60 of which were words and 60 nonwords. 

A practice list of 30 items was constructed. The structure of the list was 
consistent with that of the experimental list. 



Procedure 



Stimulus presentation was controlled by a microcomputer. A trial began 
with Dhe preser/:ation of a warning sign^^l (a " + ") that appeared in the center 
of a CRT screen for 250 ms. The warning signal was then terninatpd and, 
following a 250 ms blank interval, a stimulus item was presented. Stimulus 
items were presented in uppercase letters in the center of the screen until 
the subject responded or until 5 seconds had elapsed. RT in mill^<?econds was 
measured from the onset of the letter string. 

Subjects were instructed that they would be seeing strings of letters. 
They were told to indicate as rapidly and is accurately as possible whether or 
not each letter string was an actual English word by pressing one of two 
response buttons. If the letter string was a word, they were to press the YES 
response button. If the letter string was not a word, they were to press the 
NO response button. The YES button was pressed with the index finger of a 
subject's right hand, and the NO button with the index finger of the left 
hand. For the deaf subjects, the instructions were signed by a deaf 
experimenter who is a native signer of American Sign Language (ASL). For the 
hearing subjects, the instructions we^^e spoken by a h(»aring experimenter. All 
subjects were individually tested. 
76 
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Each suMect saw all three experimental lists, list order being randomly 
drawn from the si.^ possible orderings of the three lists. Thus, all subjects 
saw each target wc: once in each prime-target condition. Prior to testing 
with the three experimental Us:s, subjects were tested on the practice list. 

Subjects 

Deaf subjects were ]^ stuuents at Gallaudet College. All were 
prelingually and profoundly deaf with a hearing loss of 85 dB or greater 
(better ear average). All except one had deaf parents. The one subject who 
did not have deaf parents reported a family history of deafnes:- (i.e., younger 
sibling, cousin). In all oases, then, the etiology of their hearing losses 
appears to have been hereditary deafness. The reading le el of these subjects 
was assessed by means of the comprehersion subtest ■ tne Gates-Ma cGinitle 
Reading Tests (1978, Level F, Form 2), which was admin^.r-red to each subject 
ifter completion of the experiment. The median read.l;. : f^*:a6e equivalent of 
these subjects was 9.5 (Range: grade 3 I to 12.9+)„ 

Hearing subjects were 1^ 3tudents at I'ale University who reported no 
history of hearing impairment. The reading test was also given to these 
subjects, although in all but one case (a subjec- vh.^se grade equivalent was 
12.2), subjects' scores were so high as to be beyond the range for which the 
test was standardized (grade equivalent 12.9+).^ 

Results 

Of interest here are RTs to words as targets in the three prime-target 
conditions compared with RTs to these same words as primes. Table 1 shows the 
means of ^e median RTs in the four conditions. 



T-.ble 1 

Mean RT^- (in ms) tc Word. Primes and as Targets in 
the Three Experimental Conditions. 
The Mean Percentage errors are Given in Parentheses. 

Subject Group 



Condition Deaf Hearing 

Prime 520 (7.^) ^82 (5.^) 
Target 

Icentity Prime 49^ (2 4) ^58 (5.4) 

MorphologiCc 1 Prime 505 (1.8) 467 (3.6) 

Orthographic Prime 513 (5.1) 473 (4.2) 
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The r\edian correct RTs were - .>red into analyses of variance on the 
withln-subjects factor of cor ^ion (prime, target in the identity prime 
condition, target in the morphological prime cor.dition, target in the 
orthographic prime condition) and the between-subjeots factor of group (deaf, 
hearing). There was a significant main effect of condition in both the 
subjects, r(?j8) - 8,12, £ < .001, and items analyses, F(3,138) « 5.05, 
£ < .005, as well as when both were simultaneously considered, 
F'min(3,21^) - 3.11 r £ < .05. This effect did not significantly interact with 
group in either the subjects or the items analyses (totn Fs < 0. Post hoc 
Tukey (hsd) tests indicated the source of this main effect: RTs to targets 
preceded by an identity or morphological prime were significantly faster than 
RTs to the same words as primes (£ < .05), that is, RTs to targets primed by 
themselves and by morphological relatives were facilitated. There was no 
significant difference between r:3 to targets in the identity orime and 
morphological prime conditions (£ > .05). RTs to targets preceded by 
orthographic primes were not significantly facilitated (£ > .05). 

The error rates on the target words were low for both groups of subjects, 
as shown in Table 1. The analysis of the errors was generally consistent with 
the analysis of the RT data. An analysis of variance perforrnad on *-he 
percentage of errors indicatecf no significant difference in error rato^ or 
the two groups in either the subjects or the items analyses (both F s < 1 } . 
There ir'as a main effect of condition in both the subjects, F(3,78) » ^.CQ, 
£ < .005, and the item*) analyses, F(3,1 38) - 3.95, £ < .01, which appro -he' 
significance in the simultaneous consideration of both, F'min (3,211) «• 
.05 < £ < .10. Post hoc Tukey (hsd) tests indicated fewe^ errors to ■ uctii^j 
preceded >y identity a:id morphological primes than to the same words as primes 
(£ < .05). No other differences were statistically significant (all 
£S > .05). There was an interaction of cond'tion X group in the subjects 
analysis, F(3,78) - 3.19, £ < .05, b).t not in the items aiial}-ls, 
F(3f138) - 2.22, £ > .05. The fact "hat this interaction was not significant 
in the items analysis suggests that the suor*»^s or just a few subjects deviated 
from the general pattern and that these tieviant scores were responsible for 
the significant interaction in the subjects analysis. Inspection of the 
individual subjects' data supports this hypothesis. The deaf subjects 
generally produced fewer errors when the common morpheme had been previously 
accessed (i.e., in the identity and morphological priming conditions), while a 
few of the hearing subjects broke from this pattern, actually producing more 
errors to target words in the identity and morphological priming conditions 
than to priming words. 

The nonword counterparts of the four conditions indicated that there was 
facilitation of nonwords primed ty the identic-^l nonword but not of nonwcrds 
primed by morphologically related or orthographically similar nonwords for 
either subject group. An analysis of variance on the RTs to the nonword 
conditions revealed a main effect of condition, F(3,78) - 6.80, £ < ,001, that 
did not interact with group, F(3,78) - 1.97, £ > .05. There was no 
significant main effect of subject group, F(1,26) - 3.09, £ > .05. The 
analysis of the errc; data indicated no significant main effects of either 
group, F < 1, or condition, F(3,78) «« 1.69, £ > .05, and no significant 
interaction of the two variables, F(3,78) - 1 .29, £ > .05. The means of the 
median RTs (ond the mean percentage errors), collapsed across subject group, 
were 575 ms (9.^$), 550 ms (6.35«), 56l ms (9.^$), and 563 msec (7-2$), 
respectively, for the prime, the target in the identity prime condition, the 
target in the morphological prime condition, and the target in the 
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orthographic prime condition. Post hoc Tukey (hsd) tests on the RTs indicated 
that the main effect of condition was due to faster RTs to nonwords when they 
served as targets in the identity prime condition than when they served as 
primes (p < .05). There was no significant facilitation of nonword targets In 
the orthographic prime condition (p > .05). Importantly, there also was no 
significant facilitation of nonword targets in the morphological prime 
condition (p > .05), suggesting that the facilitation obtained with words in 
the morphological prime condition was due to lexical, not episodic, effects. 
Moreover, with words ifi the morphological prime condition there were fewer 
errors to targets chan primes, while, by contrast, in the nonword error data 
the percentage of errors did not significantly vary as a function of 
condition. 

Because there is evidence that highly success hearing readers/spel.lr/ 
are more sensitive to morphophonological rela: onships than are average or 
poor readers/spellers (Fischer, Shankweiler , & l/'berman, 1985; Freyd & Baron, 
1932), the question of whether deaf readers* se aitivity to morphophonological 
relationships varies as a function of reading proficiency was also examined. 
Correlations between deaf subjects' degree of priming in each of the target 
conditions and their grade level reading achievement were consistent with the 
notion that the better readers were more sensitive to morphological 
relationships than the poorer readers. Correlations were computed between 
scores on the comprehension subtest of the Gates-MacGini tie Reading Tests 
(1973) and their RT facilitation in the three target conditions. The measure 
of facilitation was the RT to primes minus the RT to targets. The 
correlations with facilitation in the identity prime condition (r - .39) and 
the morphological prime condition (r » .^1) approached significance (both 
df-12, .05 < £ .10: one-tailed). There was no significant correlation 
between reading achievement and amount of facilitation in the orthographic 
prime condition (r - .09). 

Discussion 

The rebuts of this experiment indicate that despite prelingual and 
profound hearing impairment, it is possible to acquire a sensitivity to the 
morphophonological structure of English words, even when morphological 
relations are expressed by orthographlcp I iy dissimilar representations. In 
this experiment, deaf subjects, like heai ing subjects, were facilitated In 
their response times to words that had been preceded by a morphological 
relative. Neither hearing nor deaf subje^^ts were facilitated in their 
response ^imes to words that had be^n preceded by an orthograph Ically similar, 
yet mcrT'uOj.ogically unrelated, word. 

Se.'r-rMl pieces of evidence from the present study suggest that the obtained 
facil itaticvi to target words in the morphological prime condition reflected 
lexical, not episodic influences. Episodic effects could arise in an 
experiment such as the present one because subjects remember seeing or 
responding to a particular letter string previously in the experiment. One 
indication of episodic effects In a repetition priming task is the presence of 
facilitation on nonwords (Feustel, Shiffrln, & Salasoo, 1983). There was, 
however, no such facilitation to nonv^ord targets in the morphological pr'me 
condition, suggesting that the facilitation to target words In this condition 
can be attributed to lexical effects. Moreover, the fact that there was no 
facilitation to target word RTs in the orthographic prime condition Is 
consistent with this interpretation. The number of common letters was greater 
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in the orthographic than the morphological prime condition, and this greater 
overlap should lead to larger episodic effects; yet, there was no significant 
facilitation in the orthographic prime condition. The observed facilitation 
due to inflectional and derivational relationships is therefore consistent 
with conceptualizations of lexical organization in which morphologically 
related words are tightly associated (Fowler et al., 1985; Stanners et al.. 
1979; Taft & Forster, 1975). 

Although the facilitation due to morphological priming did not differ 
significantly In magnitude from that due to identity priming, a look at Table 
1 indicates that the facilitation is numerically somewhat smaller in the 
morphological prime condition. This greater facilitation may have been due, 
at least in part, to episodic influences acting in conjunction with lexical 
effects to facilitate RTs to targets in the identity prime condition (Feustel 
et al., 1983; jrster & Davis, ^9^4; Fowler et al., 1985). Evidence 
supporting this interpretation was obtained in the nonword data where 
significant RT facilitation occurred to tar-get nonwords in the identity prime 
condition. Such an episodic influence may therefore also have affected 
response times to word targets in this condition. 

The outcome of this study suggests that de-r" readers, whose knowledge of 
English phonology may not be the same in all re..^pects as that of hearing users 
of the language, do possess and utilize a knowledge of phonology that serves 
them well, at least in certe ' i linguistic situations. Studies by other 
invest ifjator-** have shown that dei : readers (also college students) are able to 
segment morphologically complex words into their stems and affixes and are 
aware that morphologically related words are semantically related (Hirsh-Pasek 
& Freyj, 1983, 198^; Lichtenstein, in press). The present study extends such 
findings by indicating that deaf readers' lexical organization is affected by 
the morphological composition of worvlsr, 

Examination of the response time <latz\ in the present sti'^Jy reveals a 
somewhat smaller magnitude of identity and morphological primi;.g facij. ! atiun 
than in previous studies (Fowler et al., l9Sb; Stanners et al., 1979). 
Procedural differences between the present study and earlier ones may account 
for this difference. In the present studyj, each target word occurred three 
times a?3 a target item, once in each of the three experimental lists. Each 
subject was tested on all three lists. Studies have indicated that effects of 
identity and morphological priming may be apparent over relatively long time 
periods (Fowler et al., 1985; Scarborough, Cortese, & Scarborough, 1977). 
With respect to the presen experiment, this suggests that response times to 
trirget words in the second and third lists tested would be facilitated not 
only by the priming word on that list, but also by prior presentations of the 
same and related words on earlier lists. This does not in any way invalidate 
the results of the present experiment; indeed, the results for the hearing 
subjects are quite consistent with other studies of hearing readers (Fowler et 
al., 1985; Stanners et al., 1 979). However, the procedure used here would 
tend to diminish the magnitude of the facilitation effect since the response 
times to target words were averaged over the three lists. ^ Moreover, it is 
known that the magnitude of facilitation is greater for infrequently occurring 
words than for words that occur frequently (Forster & Davis, 198^; Scarborough 
et al., 1977). Nearly all the target words of the present study have a very 
high freqiu-i?;cy of occurence in written English, a result of the necessity to 
obt'='; wor.-^e. ^r^hin t:ne vocabulary cf all ^he si»hjocts. Vhis, too, is likely 
to have '.>d the effect of diminishing the magnitude of any Identity or 
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morphological priming com'^ared with other studies in the literature in which 
less frequently occurring words were i.sed. In any case, even though the 
effects reported here are numerically somewhat smaller than those reported in 
previous studies, they are still statistically significant, in spite of the 
presence of factors that might have masked a larger priming effect. 

As in experiments with hearing subjects both here and earlier (Napps & 
Fowler, submitted), the deaf subjects were not significantly facilitated in 
their response times to targets following an ort ho graphically related prime. 
The implication of this result is that words are not, by virtue of 
orthographic (i.e., visual) similarity alone, cJosely associated in readers' 
lexicons. This is apparently the case for deaf as well as hearing readers of 
Enfjiish. Although most morphologically related woras do overlap 
orthographically a great deal, the present results suggest :/nat formal 
similarity alone is not sufficient for organizing words together. Rather, 
what is required is a morphological relationship. 

There was some indication in the present study that the degree of deaf 
readers' sensitivity to morphophono logical relationships was related to their 
reading ability; specifically, the better readers were more sensitive to this 
level of linguistic structure than were the poorer readers. This finding is 
consistent with results reported for hearing subjects (Feldman, 198^; Freyd & 
Barou, 1982). Freyd and Baron (1982), for example, r^ound that superior 
fifth-grade readers outperformed average eighth-grade readers in their ability 
to decompose morphologically complex words. Thus, the superior readers werr- 
better able tc use the principles of English morphology. Based on such 
finding,-^, it has been argued that skill in using the English opthograpliy 
encompasses an ability to apprehend the morphological structure of words 
(Fischer, et al., 1985; Freyd & Baron, 1982). 

In conclusion, it should be notea that this is not the only study in which 
evidence *^as been obtained that deaf readers are able t.. acquire some 
apt- ;r» of the phonological component of English (see DoJd, ''•^80; Dodd & 

'>riiielin, 1977; Hanson, Shankweiier, & Fischer, 1983). It does, however, 
•. .tend previous work in finding that the organization of the mrntal lexicons 
of deaf readers is affected by morphological relationships captured at an 
^ustract level by the phonological component of the grammar. 

Such findings raise the question of the development of morphophonologica 1 
sensitivity in prelingually, profoundly deaf readers. These readers have 
available to them some knowledge of the spoken langu^ige that they have 
acquired through experience wi: i speaking and also lipreading. In addition, 
they have knowledge of word struclu»^e tha'. has been acquired through reading 
and f ingerspelling. (Fingerspelling is a manual representation of the 
orthography.) Each of these factors may contribute in part to the development 
of morphophonological sensitivity in prelingually, profoundly deaf readers 
(for further discussion, see Hanson, 1986; or Hanson et al., 1983). But it is 
likely that none of these factors can, by itself, account for the degree of 
morphophonological sensitivity observed in this experiment: Knowledge of the 
English .-r>und system obtained without reference to ivS phonetic aspect is 
necessarily incomplete, and the morphology of English is represented by the 
orthography in a way that assumes prior familiarity with phonology on the part 
of the reader. The relative roles * ^^e above factors in the development of 
the morphophonological sensitivity served in this experiment rerrain to be 
determined, along with the nature of the contribution made by the innaN 
linguistic abilities of deaf readers. d 
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Footnotes 

^Surveys in the United States and Canada have found that prelingual^^, , 
profoundly deaf high school graduates generally only read with a grade 
equivalent ox" about third grade (Conrad, 1979; Karchmer, Milone, & Wolk, 
1 979). Ther€jfore, the subjects of the present study v-ere quite successful 
deaf readers, some of them being quite exceptional. 

^Cne way to eliminate any effect due to presentation of tar-^v/ .<ords in 
multiple lists would be to examine only the first list on which e^jli subject 
was tested. However, within each list there were too few instances oT each 
prime-target condition to produce reliable averaged response times. Thus, 
only the response times averaged over all three lists can be considered a 
reasonable measure. 
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APPENDIX 



Target Words 

less 

fight 

more 

freeze 

catch 

rang 

feet 

teach 

grind 

find 

sunk 

r/ouse 

tooth 

think 

speech 

length 

voice 

sale 

sight 

die 

forty 

choice 

singer 

prove 

P4 



Morphological P rimes 
Inflection Derivation 
least 
fought 
most 
frozen 
caught 
ring 
foot 
taught 
ground 
found 
sink 
mice 
teeth 
thought 

speak 

long 

vocal 

sell 

see 

dead 

i our 

choose 

song 

proof 
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Orthographi c Primes 
] esson 

fig 
moral 

free 

cat 

ran 

fee 

tea 

grin 

fin 

sun 

mouth 

too 

thin 

speed 

lend 

void 

salad 

sigh 

diet 

fort 

choir 

single 

proverb 
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PERCEPTUAL CONSTRAINTS AND PHONOLOGICAL CHANGE: A STUDY OF NASAL VOWEL 
HEIGHT* 



Patrice Speeter Peddor,! Rena Arens Krakowt and Louis M. Goldsteint 



Abstract , To address the clahK t'^.. Mstener mi sperceptions are a 
source of phonological in nasal vowel height, the 

phonological, acoustic, and perceptual effects of nasall ?:ation on 
vowel Height were examined. We show that the acoustic consequences 
of oupling, while consistent with phonolocical patterns of 

nasa- ,;el raising and lowering^ do not always influence perceived 
vow( rv i.y t. The perceptual data suggest that nasalization affects 
per--5v *^ vowel height only when nasalization is phonetically 
inappi -J.: late (e.g., excessive nasal coupling) or phonologically 
inappropriate (e.g. , no conditioning environment in a language 
without distinctive nasal vowels). It is argued that these 
conditions, rather than the inherent inability of the listener to 
distinguish the spectral effects of velic and tongue body gestures, 
lead to perceptual misi nterpretations and potentially to sound 
change. 



Phonologists have long supposed that listener misperceptions? are a -source 
of phonological change (e.g., Durand, 1956; Jonasson, 1971; Ohala, 1981; Paul 
1890/1970; Sweet, 1888). Listener mi sperceptions are presumably fostered by 
ambiguities in the acoustic signal with respect to articulation. That is, a 
given acoustic patt>-rn may correspond (more or less closely) to more t ^an one 
vocal tract configuration (e.^^,, [r] and [R] are spectrally sir* a»^, but 
articulatorily very different). If a language learner were to ider.i- :.. • the 
articulatory source of an acoustic pattern incorrectly, (e.g., if l- vj Te 
perceived as [R]), then, in atterr^pting to imitate that pattern the i earner 
might produce the incorrectly reconstructed form rather than the original 
articulation. Thus, the similarity of certain segments in the acoustic domain 
could lead to their reinterpretation in the articulatory domain (e.g., [r] 
reproduced as [R]), and hence to sound change (e.g., /r/ > /R/ In German; 
Jonasson, 1971). (See Ohala, 1981 for further discussion.) 

The present study addresses the claim that listener mispercepti ons are a 
source of phonological change within the domain of nasal vowel height. 
Phonologically, there is substantial synchronic and diachronlc evidence of 
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raising and lowering of nasal vowels in languages of the world. It has been 
suggested that shifts in nasal vowel height originate with the listener, who 
attributes some of t^ne comple*< acoustic consequences of nasal coupling to 
changes in tongue height, thereby pe^c-:^; vi ne; r.s-a vowels as higher or lower 
than their non-nasal counterparts (C)ii:^-. Vv'i; -hala, 197^; Wright, 1 980). As 
we will show below, this explanation Tcr pnonc. ogical shifts in vowel height 
is acoustically plausible, since some of the spectral consequences of coupling 
the nasal and oral tracts an- similar to the effects of certain tongue body 
movements. However, ^ his spectral similarity need not lead to 
perceptual confusion as to tl;e articulatory source (i.e., tongue body versus 
velic gesture) of the spectral pattern. In fact, as we will show, nasal 
coupling does not affect porceived vowel h'^ight when nasalization of the vowel 
conforms to the phonetic and phonological structure of the listener's 
language. However, nasalization does influence perceived vowel height under 
certain conditions that are incor^istent with that structure, as when a 
conditioning environjnent for vowel nasality is absent or nasal coupling is 
excessive. It is argued that these conditions, rather than the inherent 
inability of the listener to distinguish the spectral effects of velic and 
tr:;7ue body gestures, lead to perceptual irisidentif ications and potentially to 
sound change. 

Our goal, then, ^s to shed some light on the extent to which phonological 
shifts in nasal vowel height can be attributed to listener misperceptions. We 
therefore consider three types of data: phonological (section 1), acoustic 
(section 2), and perceptual (sections 3 and 



1. The Phonological Patterns 



Diachronic and synchronic data from geographically distant and 
genetically unrelated languages indicate widespread phonological effects of 
nasalization on vowel height. For example, in French, synchronic 
morphophonemic alternations attest to historical lowering of high and mid 
vowels and raising of low ^'owels, as in (1) (where N represents any nasal 
consonant ) . 



(1) French 



[IN - *] e.g., fine/fin 'thin (fem/masc)' 

[eN - aa pl6nitude/plein 'fullness/full' 

[yN - &] une/un 'one (fem/masc)' 

[(<)N - &] jeQne/(a) jeun 'fast/fasting' 

[aN - 5] planer/plan 'to glide/level' 



Phonological studies comparing the height of contextual (allophonlc) and 
non-conte xtual (phonemic or distinctive) nasal vowels to the height of 
corresponding oral vowels have found that, when differences occur, they are 
quite systematic across languages. Cross-language patterns of nasal vowel 
raising and lowering, based on Beddor (l983)i are summarized in (2) (see 
Beddor (1983) for references). These patterns reflect synchronic allophonic 
and morphophonemic variation between oral and nasal vowel height in 75 
languages, and are generally consistent wit^ diachronic data and vowel 
inventory data from ether cross-language surveys (Bhat, 1975; Foley, 1975; 
Ruhlen, 1978; Schourup, 19?3). 
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(2) Cross-language patterns of nasal vowel raising and lowering 

a. High (contextual and non-contextual) nasal vowels are lowered 
(e*g., nasalization lowers /if and /u/ in Bengali, Ewe, Gadsup, 
Inuit , and Swahili ) . 

b. Low (contextual and non-contextual) nasal vowels are raised 
(e.g., nasalization raises /a/ in Bretor., Haida, Nama, Seneca, 
and Zapotec). 

c. Mid non-contextual nasal vowels are lowered (e.g., distinctive 
nasalization lowers /e/ and /o/ in Maithili, Portuguese, 
Shiridna, and Yuchi; distinctive nasalization lowers /e/ (but 
not /o/) in Hindi, Mixtec, and Kiowa Apache). 

d. Mid back contextual nasal vowels are raised (e.g., /of or /o/ is 
raised adjacent to N in Batak, Dutch, and Nama). 

e. A mid front contextual nasal vowel i • raised in a language where 
the corresponding back vowel is also raised (e.g., /e/ and /o/ 
are raised adjacent to N in Irish, Basque, and Havyaka Kannada); 
otherwise, mid front contextual nasal vowels lower are lowered 
(e.g., /e' is lowered adjacent to N in Armenian, Campa, Fore, 
and Tewa, -^ut /o/ does not shift in these languages). 

These patterns sh.'v that the phonological effects of nasalization on 
vowel height involve the interaction of three factors: vowel height, vowel 
context, and vowel -v. OTess. Vowel height becomes centralized — that is, 
nasalization lowers igh vowels and raises low vowels. Vowel context 
(presence or absence an adjacent nasal consonant) affects mid vowel height, 
and distinguishes lo-.':ring of mid non-contextual nasal vowels from raising of 
mid contextual nasal vowels. Vowel backness also primarily affects mid 
vowels, but a f rent-back asymmetry holds for all vowels, such that front 
vowels are more likely to be lowered than back vowels. More specifically, 
lowering of a back nasal vowel in a language implies lowering of the 
corresponding front nasal vowel in that language (Beddor, 1983; see also 
Maddieson, 198^^). 

2. Acoustic Factors 

The universality (in terms of genetic and geographic diversity) of these 
phonological patterns indicates that raising and lowering of nasal vowels are 
at least partially the result of phonetic constraints. Previously proposed 
phonetic explanationa for shifts in nasal vcwel height have appealed to 
ar* c.ulatory (Lightner, 1970; Pandey, 1978; Pope, 193^; Straka, 1955), 
c;oustic (Chen, 1971; Ohala, 197^; Wright, 1980), and perceptual (Haudri court , 
19^7; Martinet, 1955; Ohala, 1983; Passy, 1890) constraints. Indeed, a 
comprehensive explanation of the phonological data may well need to recognize 
the interaction of several phonetic, as well as non-phonetic, factors. 
However, we will consider here but a single phonetic factor, the effect of 
nasalization on the first formant region of the vowel spectrum. 

The main effect of vowel nasalization is in the vicinity of the first 
formant. According to acoustic theory of nasalization, coupling of the nasal 
tract to the oral tract adds a pole-zero pair to the low-frequency region of 
the vowel spectrum (Fant i960; Fujimura & Lindqvist 1971; Stevens, Fant, & 
Hawkins, forthcoming) * That is, the first formant F1 of the non-nasal vowel 
13 replaced in the nasal vowel by a zero FZ and two formants, a shifted oral 
formant F1 ' and an extra nasal formant FN. FN is almost cancelled by FZ when 
coupling magnitude is small, but becomes more and more prominent as coupling 
Increases. F1 • typically differs in frequency, and has a wide bandwidth and 
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low amplitude, relative to F1 of the uncoupled oral tract (Hawkins & Stevens, 
1985; Stevens & House, 1 956; Mrayati, 1 975). Some of these spectral 
.-operties of rasai vowels are illustrated in Figure 1 by the vocal tract 
transfer functions for oral and nasal versions of a high front vowel generated 
on the Hasklns Laboratories articulatory synthesizer (described below). As 
velopharyngeal coupling is increased from no coupling for oral [i] (top curve) 
to intermediate coupling (middle curve) and large coupling (bottom curve) for 
nasal [T], the frequency of F1 ' shifts upwards and FN becomes increasingly 
prominent. 
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Figure 1. Voc->l tract transfer functions for tl- versions of a high front 
vowel generated by articulatory synthesis: with no nasal coupling 
(top curve), with Intermediate coupling (middle curve), and with 
large coupling ,^Jttom curve)o Nasal couf- ing shifted Fl ' upward 
relative to Fl and introduced FN, which showed increased spectral 
prominence with larger coupling. 
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forciant, Lhat is, Fl ' > Fl (Fujimura & Lindqvist, 1971; Mrayati, 1975). This 
increase might lead us to expect nasalization to lower perceived vowel height. 
However, this expectation ignores the fact that the upward-shifted Fl ' is not 
necessarily the first peak in the nasal vowel spectrum. The lowest-frequency 
formant Jn the nasal vowel is located between Fl of the uncoupled oral tract 
and the lowest resonant frequency of the nasal tract when closed at the 
coupling end (probably 200-i400 Hz; Fujimura & Lindqvist, 1971; Stevens et al., 
forthcoming). So when Fl of the oral vowel is relatively high (as in low 
vowels), the first formant of the coupled system is a low-frequency FN, as 
seen for low back [a] and [5] in Figure 2. In contrast, the first formant of 
38 



9o 



Beddor et al.: Perceptual Constraints and Phonological Change 



the high nasal vowel that wae shown in Figure 1 is the upwards-shifted 
low-frequency oral forinant. It follows that the frequency of the first 
spectral peak is higher in a nasal vowel than in the corresponding oral vowel 
when the vowel is high» but lower when the vowel is low. This is consistent 
with the centralizing effect of nasalization on phonological vowel height 
discussed above and thus provides a tentative acoustic explanation for high 
nasal vowel lowering and low nasal vowel raising (Wright, 1980). 
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Figure 2. Transfer functions for oral [a] (top curve) and nasal [5j (bottom 
curve) generated by articulatory synthesis. Nasal coupling added a 
low- frequency FN and Increased the frequenoy of Fl ' relative to Fl . 

We can use a model of acoustic-articulatory relationships to demonstrate 
how these acoustic factors could lead to a sound change. The ability of a 
listener (or language learner) Ic reproduce an arbitrary speech sound must 
depend on knowledge that links the acoustic properties to the'r articulatory 
origins. If such knowledge were always perfect, then there would be no sound 
changes (for this reason, in any case) at all. Thus, the knowledge brought to 
bear by the imitator is in some way imperfect (perhaps due to the inherent 
ambiguities mentioned earlier). As a model of an extreme case of such 
imperfection, let us imagine a listener (imitator) who has no knowledge of 
vowel nasalization at all and who reproduces any vowel as oral. How will such 
a listener reproduce nasal vowels? 

This question can be answered using the equations developed by Ladefoged, 
Harshman, Goldstein, and Rice (1978) for calc?ilating vocal tract shapes from 
formant frequencies. These equations are based entirely on oral vowels. 
Thus, the equations embody the acoustic-articulatory knowledge of a potential 
imitator ignorant of nasal vowels, ^e used these equations to calculate vocal 
tract shapes from the formants of the oral and heavily nasalized vowels shown 
in Figure 1. For the nasal vowel, the shifted oral formant (Fl') was usea as 
the lowest formant in the calculation. Figure 3(a) shows the vocal tract 
shape (of the articulatory synthesizer) that was actually u«ed to generate the 
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transfer functions in Figure 1 (except that the velar port was open in the 
nasal vowel). Figure 3(b) shows the recovered vocal tract shapes using the 
Ladefoged et al. equations. Ignoring obvious differences in the pharynx (the 
equations do not recover the shape of the lower pharynx), the recovered [i] is 
very much like the original. However, the shape recovered for [I] is 
substantially lower. Thus, lack of knowledge of the effects of nasalization 
results in a high vowel being reproduced as a lower (oral) vowel. It is in 
this fashion that a sound change could develop. Of course, it is unlikely 
that any potential imitator has no knowledge of nasalization — the model simply 
shows the degree of lowering that would be expected in the most extreme case. 
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Figure 3. (a) Vocal tract shape of articulatory synthesizer used to compute 
the transfer functions in Figure 1. (b) Vocal tract shapes 
recovered from the formant frequencies of the transfer functions in 
Figure 1 (see text) . 

2.2 Center of Gravity 

Although the effects of nasal o-oupling on the location of the first peak 
in the vowel spectrum are consistt^:it with contraction of the height dimension, 
they do not appear to account for llr.e fpont-back asymmetry in the phonological 
data. If we extend our acoustic mi^asure of oral and nasal vowels to include 
not only frequency of the first spectral peak, but also frequency and relative 
air*plitude of spectral peaks in the low-frequency region, we arrive at a more 
comprehensive explanation of the phonological patterns. Chistovich and her 
colleagues have found that perceived height of oral vowels reflects a "center 
of gravity" determined by the frequency and amplitude of spectral prominences 
in the F1-F2 region (Bedrov, Chistovich, & Sheikin, 1978; Chistovich & 
Lublinskaya, 1979; Chistovich, Sheikin, & Lublinskaya, 1979). Due to the 
complex acoustic effects of nasal coupling, nasalization can cause a shift in 
the center of gravity of the vowel spectrum that need not correspond to a 
90 



97 



Beddor et al.: Perceptual Constraints and Phonological Change 



parallel shift in the frequency of the first spectral peak. For example, in 
the naturally produced mid front vowels in Figure i|, the frequency of the 
first spectral peak is lower in nasal lS] than in oral [e], but the overall 
effect of the pole- zero- pole combination in the low-frequency region of the 
nasal vowel is to pull up the center of gravity relative to the oral vowel. ^ 




Figure iJ. LPC spectra of oral [e] (unfilled) and nasal [S] (filled) produced 
by a Hindi speaker. The nasal spectrum has a lower-frequency first 
peak, but a higher-frequency center of gravity, than the oral 
spectrum. 

Beddor (1983) measured the center of gravity of oral and nasal vowel 
tokens from several languages by calculating the average frequency of the area 
under the spectral envelope in the F1-F2 region. This measure was 
consistently higher for [r S] than for [i e], lower for [^5 Q] than for 
[ae a o], and roughly the same for [a] and [u]. Assuming that an increase in 
center of gravity lowers perceived vowel height and a decrease raises 
perceived height, we would expect nasalization to perceptually lower /i e/, 
raise /ae a o/, and have little effect on /u/. Thus, oral-nasal differences in 
center of gravity are consistent with the front-back asymmetry of the 
phonological data as well as high-low centralization. 



3. Perceptual Validation 
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We have shown that the effects of nasal coupling on the low-frequency 
region of the vowel spectrum are generally consistent with the phonological 
patterns of nasal vowel raising and lowering. However, the acoustic data 
"explain" the phonological shifts only if the listener is misled by the 
resemblance between spectral changes due to nasal coupling and those due to 
tongue body movements; that is, if the listener has imperfect knowledge of 
acoustic-articulatory relations, as discussed above. Rather than assign all 
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of the spectral consequences of nasalization to the velic gesture that couples 
the oral and nasal tracts, the listener must incorrectly attribute some of 
those spectral effects to a tongue gesture that modifies the oral tract 
configuration. Is there empirical evidence of such misperceptions? In 
answering this question, we hope to shed light not only on the role of 
listener misperceptions, but also on the relevance of context and speaker 
variability to vowel height shifts » 

3> 1 Perception of Non-'Contextual Nasal Vowels 

Several studies have investigated the perception of nasal vowel height. 
Wright (1980) produced natural oral and nasal vowels having the same tongue 
configuration, but differing in the position of the velum. All possible 
pairings of oral and nasal vowels were presented to listeners for similarity 
judgments. The perceptual vowel space constructed from listener responses 
showed centralization of nasal vowel height relative to oral vowel height. 
Acoustic analysis of the vowels indicated that this centralization did not 
always correlate with frequency differences in F1 ' versus F1 , but might be 
partially due to the extra low-frgquency FN in the nasal vowels. 

In contrast to Wright's articulatorily matched vowels, Beddor (1984) 
paired oral and nasal vowels generated by formant synthesis. Listeners heard 
vowel sets in which a continuum of oral vowels (varying in the frequency of 
F1 ) was compared with a nasal vowel standard; they selected the oral vowel in 
each set that sounded most similar to the nasal standard. Listeners rarely 
chose the oral vowel in which F1 frequency was the same as F1 ' frequency in 
the nasal vowel. In general, listeners' choices were pulled towards FN of the 
nasal vowel: when FN frequency was low, the oral vowel chosen as the "best 
match" had a relatively low F1 frequency; when FN frequency was high, the oral 
match had a high F1. Apparently (as in Wright's study), shifts in the 
spectral center of gravity due to the added nasal formant affected perceived 
vowel quality. 

In a recent study reported in Krakow, Beddor, Goldstein, and Fowler (in 
preparation), we used articulatory synthesis to investigate the effects of 
nasal coupling on perceived vowel height. The Haskins articulatory 
synthesizer allows specification of a mid-sagittal outline of the vocal tract 
by means of the positions of six articulatory parameters: Jaw, hyoid, tongue 
body center, tongue tip, lips, and velum. The program computes the area 
functions for the specified vocal tract outlines. Speech output is obtained 
after acoustic transfer functions are computed for these area values (see 
Abramson, Nye, Henderson, & Marshall, I98l ; Rubin, Baer, & Mermelstein, 1981). 

In our study, we focused on the English /eZ-Zae/ contrast and generated 
seven vowels by systematically lowering and retracting the tongue body, as 
shown in Figure 5. These vowel shapes were then embedded in an articulatory 
context appropriate for [b_d] and two T^step continua were generated: oral 
[bed-baed] and nasal [bed-bcBd]. The use of articulatory synthesis ensured 
that the only difference between the continua was that the velopharyngeal port 
was open during the vowel portion of the nasal, but not the oral, stimuli. 
Identification tapes for the two continua consisted of 10 tokens of each 
stimulus ai-ranged in random order. Tapes were played to 12 phonetically 
naive native speakers of American English, who labeled the stimuli as bed or 
bad; they had no difficulty identifying the nasal vowel stimuli as such. 
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Figure 5. Vocal tract outlines of the seven steady-state vowel configurations 
specified by lowering and retracting the tongue body in equal 
articulatcry steps from /e/ to /ae/. 




STIMULUS NUri-3ER 

Figure 6. Pooled identifications functions (n = 1?,) for the oral [bed-baed] 
(squares) and the nasal [bed-bSd] (circles) continua. 
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The identification functions in Figure 6 show the percent /e/ responses 
for both the oral (indicated by the squares) and the nasal (the circles) 
stimuli. There were fewer /e/, and therefore more /ae/, responses to the nasal 
vowels than to the oral vowels; that is, nasalization lowered perceived vowel 
height. This perceptual lowering is consistent with certain acoustic 
consequences of coupling the nasal tract to an /e/^like oral tract 
configuration. For example, Figure 7 gives the transfer functions for 
stimulus ^, which listeners more often labeled /e/ when oral but /ae/ when 
nasal. Although FN and Fl ' of the nasal vowel straddle F1 of the oral vowel, 
the predominant peak in the low frequencies of the nasal vowel spectrum is the 
upward-shifted F1'. The identification data can be interpreted as a tendency 
for listeners to associate the frequency shift induced by nasal coupling with 
lowering of the tongue body. 



Oral 
NqsoI 




FREQUENCY (Hz) 

Figure 7. Transfer functions for the steady-state vowel portion of stimulus ^ 
from the oral (unfilled) and nasal (filled) continua. The first 
spectral peak has a lower frequency in the nasal vowel than in the 
oral vowel, but the predominant spectral peak in the nasal vowel is 
the upward-shifted Fl*. 

Our perceptual findings, like those of earlier studies, suggest that 
listeners have difficulty assessing the individual contributions of vowel 
quality and nasalization to the spectral shape of the nasal vowel. Listeners 
may hive attributed the spectral shifts in part to nasalization, thus leading 
to the perception of a nasal vowel differing in height from the corresponding 
oral '//owel. Alternatively, the spectral shifts may have been attributed 
entirely to oral tract shape, leading to the percept of a shifted oral vowel. 
Nonetheless, the data c"^.early show that spectral effects of nasalization on 
vowels produced in isolation or in an oral context (i.e., non-contextual vowel 
nasalization) are prone to misinterpretation by American listeners. And yet 
it would be premature to interpret these findings as evidence that listener 
misperceptions are a source of phonological shifts in nasal vowel height. 
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These listeners may have been prompted to resol^'^ the spectral effects of 
non-contextual nasalization in terms of a tongue gesture only because of their 
unf amiliarity with distinctive nasal vowels,^ That is, such misperceptions 
might not occur when the context for nasality is phonologically appropriate 
for the listener. To test this possibility, we need to look at the perception 
of non-contex^.ual nasal vowels in languages with distinctive vowel 
nasalization and also at the perception of contextual nasal vowels (i.e., 
nasal vowels in the immediate context of a nasal consonant) in languages with 
anticipatory or perseverati ve nasalization. Some of our research addresses 
the second of these two issues. 

3*^ Perception of Contextual Nasal Vowels 

Lowering of the velum for a nasal consonant has been found to begin 
during a preceding vowel to some degree in all languages investigated. 
Subjsjitantial anticipatory vowel nasalization has been documented for many 
languages, including English (Ali, Gallagher, Goldstein, & Daniloff, 1971; 
Clumeck, 1976; Mal^cot, I960; Moll, 1962). In Krakow et al. (in preparation), 
we tested our English-speaking subjects' perception of not only oral [bed-baed] 
and non-contextual nasal [bed-bad]^ but also contextual nasal [bend-bfeid]. 
We speculated that in the [bVnd] condition, the spectral effects of 
nasalization on the vowel might be attributed to an anticipatory velic 
lowering gesture for the nasal consonant^ thus allowing more accurate 
assessment of vowel configuration than in the [bVd] condition. 

Support for our speculation is provided by previous studies in which 
listeners were shown to be sensitive to coarticulatory information. In a 
study of vjwel^nasality , Kawasaki (1986) reported that perceived nasality of 
vowels in [mVm] syllables was enhanced by attenuation of the adjacent nasal 
consonants. Her results suggest that listeners partially "factored out" vowel 
nasalization when the conditioning environment for nasalization was 
perceptually salient. Ohala, Kawasaki, Riordan, and Kaisse (in preparation; 
see also Ohala, 1981) looked at listeners' ability to recognize the 
coarticulatory fronting effects of apical consonants on adjacent /u/. They 
found that vowels ranging from [i] to [u] were more often labeled as back /u/ 
when flanked by apical consonants ([s_t]) than by labial consonants ([f_p]), 
that is, listeners apparently discounted some of the frontness of the vowels 
in the apical context as due to coarticulatory effects. Other studies have 
suggested that listeners are able to factor out coarticulatory effects not 
only of consonants on vowels, but also of vowels on consonants (e.g.. Fowler, 
198^; Kunisaki & Fujisaki, 1977; Mann & Repp, 1980; Whalen, 1981) and vowels 
on vowels (Fowler, 1981). These data all suggest that knowledge of how 
phonetic units are coproduced influences speech perception. (More specific 
theoretical accounts of such facts have been proposed in Fowler, 1983; 
Liberman & Mattingly, 1985.) We thought that such knowledge might enable 
listeners to distinguish the effects of nasalization from those of tongue 
shape on the spectrum of a contextual nasal vowel. 

In our study, the contextual nasal condition [bend-bagnd] was matched as 
closely as possible to the oral and non-contextual nasal conditions described 
above. All vowel stimuli had the tongue shapes shown in Figure 5. The 
contextual nasal continuum was the same as the non-contextual nasal continuum, 
except that the velopharyngeal port in the contextual nasal stimuli was open 
not only during the vowel, but remained open (at 16.8 mm^) for 80 ms of the 
137 ms alveolar occlusion, yielding natural-sounding [bVnd] sequences. Since 
the steady-state portions of corresponding contextual and non-contextual nasal 
vowels were identical, we hypothesized that if the perceived height of nasal 
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vowels were strictly a function of their spectral characteristics, then 
listeners would judge vowel height to be the same in the two nasal conditions. 
However, if tacit knowledge of anticipatory nasalization in English enabled 
listeners to factor out the spectral effects of contextual nasalization, then 
perceived height of the contextual nasal vowels would be more, if not exactly, 
like that of the oral vowels. 



Labeling responses to the contextual nasal stimuli were obtained from the 
12 subjects who identified the oral and non-contextual nasal vowels. The 
experimental procedure was the same as described above, except that subjects 
labeled the nasal stimuli as bend or band (as opposed to bed or bad). In 
Figure 8, the identification responses to the contextual nasal [bVnd] stimuli 
(thp diamonds) a^e compared with the [bVd] and [bVd] functions from Figure 6. 
Notice that the point at which subjects shifted from /c/ to /ae/ responses 
(i.e., the 50% crossover point) in the [bVnd] condition was the same as in the 
[bVd] condition; that is, contextual nasalization had no effect on perceived 
vowel height. 
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Figure 8. Pooled identification function (n = 12) for the contextual nasal 
[bend-b&id] continuum 'diamonds) as compared with the oral 
[bed-baed] and non-contextual nasal [bed-bSdl functions (see Figure 
6). 



3.3 Discussion 



The perceptual data call into question simplistic accounts of the relation 
between listener misperceptions and nasal vowel height shifts. First, that 
listeners did not misjudge nasal vowel height when provided witli a 
cc4idi tioning environment for vowel nasalization fails to support the idea that 
changes in contextual nasal vowel height are due to listeners misinterpreting 
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the spectral effects of nasalization as cues for vowel height. Secondly, 
although the finding that listeners misjudged nasal vowel height in the 
absence of a conditioning environment might appear to support listener 
misinterpretations as a source of non-contextual height shifts, this finding 
must be evaluaf;ed in light of the language background of the listeners. Our 
explanation for per'ceptual lowering of the non-contextual nasal vowels is that 
our American listeners did not expect a nasal vowel in the context [b_d] and 
consequently perceived the spectral changes introduced with nasal coupling as 
due at least in part to tongue configuration. This reasoning prompts us to 
expect different results if we were to obtain judgments of the no n- contextual 
nasal vowels from listeners whose native language has distinctive vowel 
nasalization. Since these listeners "expect" nasal vowels to occur in oral 
(as well as nasal) contexts, we hypothesize that no n- contextual nasalization 
would have less of an effect — or perhaps no effect — on their perception of 
vowel height. 

If listeners can separate the spe ^".ral effects of nasal coupling from those 
of tongue configuration, how then do we explain phonological raising and 
lowering of nasal vowels? We could, of course, turn to art i dilatory or even 
non-phonetic explanations, but the consistent correlations between the 
acoustic effects of nasalization and the phonological patterns make us 
reluctant to reject an acoustic- perceptual approach. We can maintain that 
listener misperceptions lead to shifts in nasal vowel height if we can show 
that normal perceptual processing occasionally fails. Specifically, since 
listeners normally distinguish the acoustic consequences of velic versus 
tongue body gestures, we need to show that this distinction can break down 
under certain conditions. In the next section, we consider two conditions 
that could lead to perceptual confusion and potentially to sound change. 

^. Sources of Perceptual Confusion 

^.1 Loss of Conditioning Environment 

Ohala (I981, 1983) nas argued that many sound changes in which loss of the 
conditioning environment co-occurs with the conditioned change can be 
explained by the listener ' failure to detect the conditioning segment. We 
believe a similar argument provides a tentative explanation for shifts in 
non-contextual nasal vowel height. 

In the vast majority of languages that have distinctive nasal vowels, such 
vowels evolved from earlier sequences of phonemic oral vowels followed by 
nasal consonants (Ferguson, 1963) or preceded by nasal consonants (Hyman, 
1972). One account of phonemicization of vowel nasalization with concomitant 
nasal consonant loss is that the perceptual salience of vowel nasality 
increased as the perceptual salience of the conditioning nasal consonant 
decreased (see Kawasaki , forthcoming).^ However, at the transition stage, 
distinctive vowel nasalization is not fully integrated into the language. If 
listeners do not expect non- contextual nasal vowels but also do not perceive 
the nov.-weaker.ed nasal consonant, then t-hey might attribute the acoustic 
effects of vowel nasalization to either (A) nasal coupling, (B) change in 
tongue configuration, or (C) both nasal coupling and change in tongue 
configuration. Under these conditions, we would expect /VN/ or /NV/ to result 
historically in (A) /V/ with nasalization but no height change, (B) /W with 
height cho^nge but no nasalization, or (C) /VV with height change and 
nasalization. 
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Language data provide evidence of all three types cf phonological change. 
There are numerous type A languages in which nasalization has no marked effect 
on vowel height. For example, nasal vowel inventories were reportedly the 
same as oral vowel inventories in 77 of the 155 languages with phonemic nasal 
vowels surveyed by Ruhlen (1978). Possible examples of type B change include 
Creek *en > a (*hekenton > hekaton; Foley, 1975), Colloquial Tamil final e^n > 
ae: (Bhat, 1975) and Old Norse, in which i and u lowered when a following 
nasal consonant was lost, but the nasality of the lowered vowels is uncertain 
(Bhat, 1975). Type C lariguages are more difficult to identify, since 
distinctive nasalization and height shift must be shown to have occurred more 
or less simultaneously. One such language appears to be French. Accounts of 
the evolution of French low non- contextual nasal vowels from non-low vowels 
followed by nasal consonants disagree on the relative order of distinctive 
nasalization and vowel lowering (ccmpare Entenraan, 1977; Haden & Bell, 1964; 
Martinet J 1965; Pope, 193^), but the disagreement itself suggests that the two 
changes occupied roughly the samo time period. 

Evidence of type A languages (/VN/ > /V/) indicates that nasal consonant 
loss is not a sufficient condition for phonological shifts in nasal vowel 
height. At^the same time, the existence of type B (/VN/ > /V»/) and type C 
(/VN/ > /VV) languages suggests that nasal consonant loss is a possible 
trigger for such shifts. These phonological data correspond to our 
experimental^ results with Aroericjan English speakers showing perceptual height 
shifts in [bVd] sequences, although our results fail to distinguish whether 
listeners attributed all (as In type B languages) or only some (as In type C 
langUs-^ges) of the spectral consequences of nasalization to tongue height. 

Our claim, then, is that listeners' ability to distinguish the acoustic 
consequences cf velic versus tongue body gestures might break down if the 
listener encounters a nasal vowel, but neither detects a conditioning nasal 
consonant nor expects non-contextual vowel nasalization. We hypothesize that 
these conditions lead to ambiguity as to the nasality of the vowel. This 
uncertainty coi^ld in turn lead to changes in vowel height if the listener were 
to resolve at least some of the acoustic effects of nasalization in terras of 
tongue configuration. 

We have argued that, as a result of naaal consonant loss, there might be 
perceptual Cifflbiguity leading to changes in vowel height. The next section 
postulates a second source of listener misperceptions that could influence the 
height of not only no n- contextual, but also contextual, nasal vowels. 

4.2 Variability in Production 

Most of our discussion of nasal vowels has approached vowel nasalization as 
a binary distinction, such that vowels are either nasal or non-nasal. But 
there is considerable variation in degree of vowel nasalization across vowel 
tokens , types , and contexts , as well as across speakers and languages 
(Benguerel, Hirose, Sawashima, & Ushijima, 1977; Clumeck, 1976^ Henderson, 
1984; Ohala, 1971a; Ushijima & Sawashima, 1972). And as we have already seen 
(section 2), different magnitudes of nasal coupling have different effects on 
the vowel spectrum. 

What influence, then, might variability in degree of nasalization have on 
vc rel height? Consider, for example, a vcwel followed by a nasal consonant. 
It seems reasonable to assume that the presence of the nasal consonant gives 
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rise to certain expectations about the nasality of the vowel. If expectations 
are met, listeners should be able to factor the correct amount of vowel 
nasalization out of the vowel spectrum. But they might factor out too much 
(i.e., overcompensate) if nasalization is unexpectedly weak, or too little 
(under compensate) if nasalization is excessive. (See Ohala, 1981, 1983 for 
discussion of the possible role of overcompensation in sound change.) Both 
errors could affect perceived vowel height: overcompensation would reverse 
the direction of the height shifts predicted by acoustic factors, while 
undercompensation would yield the predicted shift. 

Some of our [bend-bc&id] results address this issue. The data presented in 
section ^ were for a moderate (i.e., natural-sounding) amount of 
velopharyngeal port opening. But listeners were also tested on contextual and 
non- contextual nasal vowel stimuli produced with a small port opening (where 
nasalization was judged by the experimenters as perceptually weak) and a large 
port opening (where nasalization was judged as perceptually strong). The 
small port opening should raise the perceived height of the nasal vowels 
relative to the oral vowels if listeners overcompensate for weak nasalization; 
the large port opening should lower nasal vowel height if listeners 
under com pens ate for strong nasalisation. 

Figure 9 gives^the identification responses to the contextual [bVnd] and 
non-contextual [bVd] stimuli with small (7.2 mm^) and large (2^.0 mm^) port 
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Figure 9. Pooled identification functions (n = 12) for the nasal continua 
generated with a small velar port opening (a) and a large port 
opening (b). The oral function is redrawn for comparison. 

openings. In comparison with the oral [bVd] function, the nasal functions in 
Figure 9a show that, although weak non-contextual nasalization aid not 
influence perceived vowel height, weak contextual nasalization slightly raised 
perceived height (i.e., there were mure /e/ responses to the [bVnd] stimuli 
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than to either the [bVd] or the [bVd] stimuli). Since our American English 
listeners presumably expected nasalization in the contextual, but not the 
no n- contextual, nasal conditions, our data suggest that listeners 
over com pen sated for unexpectedly weak nasalization. This finding lends 
support to the speculation by Ohala (1983) that phonological raising of mid 
contextual nasal vowels (as opposed to lowering of mid non- contextual nasal 
vowels) might be explained by listener overcompensation for contextual 
nasalization. 

In contrast. Figure 9b shows that strong non-contextual and contextual 
nasalization lowered percleved vowel height, with the no n- contextual nasal 
vowels exhibiting greater lowering (I.e., the [bVd] stimuli elicited the 
fewest /e/ responses and the [bVd] stimuli the most). We Interpret these 
results as evidence that listeners undercompensated for unexpectedly strong 
nasalization. 

The implication of these findings for sound change is that variability in 
degree of vowel nasalization could cause perceptual uncertainty as to the 
relative contributions of the nasal and oral tracts to the vowel spectrum. 
Both weak and strong nasalization could lead to height shifts because of 
listener failure to correctly assess these contributions. 

5. Further Questions and Conclusion 

Several issues concerning nasal vowel height have not yet been resolved. 
We have not yet studied the perception of nasal vowel height by speakers of a 
language with distinctive vowel nasalization. While we have speculated that 
such listeners would show little or no effect of non-contextual nasalization 
on perceived vowel height, absence of these data clearly limits our 
understanding of listeners' ability to factor out the effects of nasal 
coupling on the vowel spectrum. Unfortunately, this experiment may prove to 
be difficult to do, since many languages with distinctive nasal vowels show 
vowel quality differences between oral and nasal vowels, or phonotactic 
constraints against /CVHC/ sequences, or both (severely limiting the use of 
our stimuli for these purposes). 

Another concern is that the timing of the velic gesture, like its magnitude 
(i.e., size of velopharyngeal opening), can differ in speakers^ productions of 
nasal vowels, depending on the quality of the vowel, the speaker, and the 
language (Clumeck, 1976). We still need to determine how temporal variability 
in the onset of the velic gesture affects the perceived height of nasal 
vowels. However* the present work leads us to conjecture that, for a given 
language, there is an "expected" temporal pattern and that deviations from 
that pattern (e.g., premature velic lowering; would lead to perceptual 
ambiguity and perhaps phonological change in nasal vowel height. 

In summary, we have seen that there are consistent cross-language 
phonological patterns of nasal vowel height defined by the interaction of 
vowel height, context, and backness. We have also seen that a primary 
acoustic consequence oC nasalization is the introduction of a pole-zero pair 
in the vicinity of Fl, the effect of which is to shift the center of gravity 
in nasal vowel spectra relative to corresponding oral vowel spectra. These 
center of gravity shifts can account for two important variables in the 
phonological data, vowel height and vowel backness, and therefore provide 
phonetic motivation for most of the phonological patterns if these acoustic 
100 
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effects of vowel nasalization affect perceived vowel height. The perceptual 
data suggest that listeners misperceive nasal vowel height only when 
nasalization is phonetically inappropriate (e.g., excessive nasal coupling) or 
phonologically inappropriate (e.g., no conditioning environment in a language 
without distinctive nasal vowels). If inappropriate nasalization were unique 
to the laboratory setting, then these perceptual findings would oblige us to 
reject the claim that listener misperceptions are a source of nasal vowel 
height shifts in natural languages. However, even though inappropriate 
nasalization is not the "norm," variations in degree of nasalization and in 
the perceptual salience of the conditioning environment for vowel nasalization 
are normal consequences of speech production and perception and as such are 
the raw material of nasal vowel height shifts. Thus, we are brought, from 
another direction, to recognize the importance of variation in accounting for 
sound change (cf. Weinreich, Labow, & Herzog, 1968). It should be clear, 
however, that our acoustic-perceptual account of phonological changes in nasal 
vowel height has been restricted to the initiation of these changes. We have 
not attempted to specify the processes by which listener misperceptions become 
stable phonological patterns. 

We have argued that listener familiarity with a particular phonetic and 
phonological structure leads to certain expectations with re^'^ect to vowel 
nasalization. Listeners correctly assess the contribution of iv; ?al coupling 
to the vowel spectrum when these expectations are met, but when they are not, 
listeners apparently choose tongue configuration as an alternative source of 
the spectral effects of nasal coupling and thereby misperceive nasal vowel 
height. We conclude, then, that a comprehensive explanation of sound change 
in a language must take into account net only the physical (articulatory , 
acoustic, or perceptual) origins of the change, but also the phonetic and 
phonological structure of the language, including variability in that 
structure. 
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Footnotes 



^The F2 differences in [e] aiid [S] in Figure ^ suggest that the two vowels 
may have been produced with different oral tract configurations, thus we 
cannot say to what extent the shift in center of gravity is due to 
nasalization per se. 

^Although English does not typijally have distinctive nasal vowels before 
voiced stops, an apparent exception to nondist inct ive vowel nasalization in 
American English occurs before voiceless stops. Mal^cot (I960) found that 
nasal consonants before voiceless stops are of extremely short duration, and 
may possibly be absent for some speakers, suggesting the existence of minimal 
pairs (e.g., cat versus can ^ t ) differing only in vowel nasality. 

'For some discussion of factors leading to weakening of nasal consonants, 
see Lightner (1 970), Ohala (1971b), Schourup (1 973), Foley (1975), Entenman 
(1977), and Ruhlen (1978). 
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THE THAI TONAL SPACE* 
Arthur S. Abramsont 



In the analysis of a tone language, the linguist normally thinks first of 
pitch levels and glides as the probable phonetic basis of phonologically 
relevant tones. This is true even though there may be other features, 
apparently secondary in importance, that go along with pitch. Of course, it 
is well known that in some languages, as in certain dialects of Vietnamese, a 
feature other than pitch may be dominant in one or more of the tones. 

Against the background of earlier auditory (e.g., Haas & Subhanka, 19^5) 
and instrumental (Bradley, 1911) analysis, Abramson (1962) was apparently the 
first to combine techniques of acoustic analysis and speech synthesis to 
investigate the tones of Central Thai (Siamese) — or, indeed, any tone 
language — both acoustically and perceptually. Since then, of course, other 
such treatments of Asian languages, including Thai, have appeared (e.g.. 
Candour, 1978). 

The present study is part of an ongoing exploration (e.g., Abramson, 
1975, 1976) of the Thai tonal "space." This space is taken to be the set of 
articulatory and auditory dimensions by which the speaker is constrained in 
production and perception. The paper makes use of unpublished or reanalyzed 
data obtained in Thailand from time to time at the old Central Institute of 
English Language at Mahidol University, the Faculty of Humanities of 
Ramkhamhaeng University, and the Faculty of Arts of Chulalongkorn University. 
It has three broad goals: to revalidate earlier work on "ideal" contours for 
the tones on isolated monosyllables, to gain some insight into the latitudes 
of shifting levels and glides for the intelligibility of the tones, and to 
take another look (cf. Abramson, 1978) at the typological usefulness of the 
distinction between static and dynamic tones. 

The identif lability of isolated natural Thai tones had been demonstrated 
in Abramson (1 962) and v:as reaffirmed with much more extensive testing in 
Abramson, 1975. These findings were a necessary precursor to the five 
experiments with synthetic tones presented in this report. Aside from the 
baseline data for all five tones obtained in Experiment 1, the report gives no 
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serious attention to the falling tone, which will have to be treated in 
another paper. 



Experiment 1 



The major physical correlate of the psychological feature pitch is 
fundamental frequency (F^), which, for speech, varies with the vibration rate 
of the larynx. The speech synthesizer used in Abramson (1 962) has long since 
gone out of use. For this experiment, and the rest, the Haskins Laboratories 
computer-controlled formant synthesizer was used. The syllable specified 
segmentally as [k^a:] was chosen as the carrier for the five tones of Central 
Thai, yielding five tonally differentiated words. Each synthetic syllable was 
made ^50 ms long. The frequencies and amplitudes of three steady-state 
formants, simulating resonances of an adult male vocal tract, were made 
appropriate for a vowsl of the type [a:], with formant transitions that 
yielded the percept of an initial dorso'-velar stop. Timing of the source 
functions was set to produce a voiceless aspirated stop. This was done by 
turning on a turbulent source for the first 80 ms of the pattern (Lisker & 
Abramson, 1970), followed by a periodic buzz source to simulate glottal 
pulsing for the remaining 370 ms; the latter served as the carrier for the Fq 
contours. A slight upward tapering of the overall amplitude at the beginning 
and a slight downward one at the end made for greater naturalness. 

For Experiment 1, the five Fq contours (Figure 1) found in Abramson 
(1962) to be ideal for the synthesis of the tones were replicated as closely 
as possible with the newer synthesizer and imposed on tokens of the carrier 
syllable. These were played in a number of random orders, over the period of 
a month, to 37 native speakers of Central Thai, who wrote their responses as 
words in Thai script. The results, given in Figure 2, reveal rather robust 
identification functions. The two. least satisfactory percepts are the mid and 
low tones, although both contours do achieve 6Q% identification. The falling, 
high, and rising tones are at least 10$ higher. All three of them, including 
the allegedly static high tone, involve much F^ movement. 



Experiment 2 



In this experiment and in the remaining three, simple straight-line 
contours were used for a partial exploration of the tonal space. The 16 
contours prepared for Experiment 2 are shown in Figure 3. These variants all 
start at 106 Hz, the top of the lower third of the voice range, and go to 
endpoints ranging x>an 90 to 152 Hz in 4-Hz steps. (An accidental exception 
is a 6-Hz step from 106 to 112 Hz.) 

Four hypotheses were put forth: (1) The beginning portion of this 
fanlike array iu too low in the voice range for mid-tone responses. (2) The 
falls at the lower part of the array are too low and slow for the falling 
tone. (3) The upper variants rise too slowly for the rising tone. (4) The 
labels used for the set by the subjects should be mainly "low" and "high." 

The responses to the stimuli are given in Figure 4. The first hypothesis 
is weakly confirmed in that the mid tone has a peak, just for the level 
variant at 106 Hz, of only 39^. The se^^nd hypothesis is confirmed; the word 
with the falling tone is not used as a label at all. The third hypothesis is 
not well supported, since the highest variant is labeled "rising" 64$ of the 
time; however, this peak, with only two variants above 50$, is not very robust 
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EXPERIMENT 1: 'IDEAL' FO CONTOURS 
From Abramson (1962) 




OUKATIOM 



Figure 1* F^, contours for the Thai tones of an adult male on long vowels 
resynthesized from Abramson (1952: Figure 3.6). 



EXPERIMENT 1 
'Ideal' FO Contours 




"Mid" "Low" "Falling" "High" "Rising" 



Figure 2. Experiment 1 : Identification of the contours of Figure 1 by 37 
subjects. 107 
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EXPERirfENT 2: FO CONTOURS 
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Figure 3. Sixteen F,, contours moving from 106 Hz to endpoints ranging from 
90 to 152 Hz. 
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Figure 4. Experiment 2: Identification of the contours of Figure 3 by 38 
subjects. 
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compared with the high tone, which has seven variants above 50%, and the low 
tone, which has Tive variants above 50$ and a peak at 90%. As for the fourth 
hypothesis, it is true that the major peaks in the figure are for the low and 
high categories, but the labeling function for the rising tone is conspicuous 
too, while the mid tone, reaching a peak of 39%, is at least not negligible. 

Experiment 3 

Figure 5 shows the stimuli for this experiment. They are 17 Fq contours 
on tokens of the [k^^a:] carrier syllable. The contours all start at 90 Hz, 
the bottom of the simulated voice range, and ^7o to endpoints ranging, once 
again, from 90 to 152 Hz in 4-Hz steps. (The exception is the first step, 
which is from 90 to 92 Hz.) The original intent had been to make 92 Hz the 
bottom frequency. 

This array was meant to explore four* hypotheses: (1) The onsets are too 
low in the voice range to yield the mid tone. (2) The low onsets should give 
a much better rising category than in Experiment 2. (3) There should be no 
high-tone responses. (4) The first two or three contours at the bottom ought 
to be heard mainly as the low tone. 

The results of Experiment 3 are given in Figure 6. With the labeling 
function of the mid tone hovering around 10$ over the first half of the 
stimulus array and then dropping to nothing, the first hypothesis is well 
supported. 

The rising-tone category is clearly more robust here than in Experiment 
2, thus confirming the second hypothesis. More abrupt rises to the same 
endpoints produce more convincing tokens of the rising tone. Although the 
labeling function for the high tone is rather poor, with a plateau at about 
'•^0$ for four of the stimuli, this result does not bear out the very 
categorical prediction of the third hypothesis. Of course, this should be 
compared with Experiment 2 in which the higher starting point led to a much 
more robust high-tone percept. In agreement with the fourth hypothesis, the 
first few contours are heard predominantly as the low tone; however, the 
greater area under the "low" curve in Experiment 2 (see Figure 4) suggests 
that a slight fall enhances the acceptability of those stimuli. 

Experiment 4 

This time, the full voice range furnishes the set of beginning points and 
the top of the range, the endpoint. Thus, as shown in Figure 7, the 
beginnings of the 16 contours range from 90 to 152 Hz in 4-Hz steps, except 
for a 5-Hz step at tha bottom (90 to 95) and a 3-Hz step at the top (149 to 
152). All the contours end at 152 Hz. 

The hypothesis here is that only the high and rising tones should be 
heard. This portion of the tonal space st;ems utterly unsuitable for any other 
tone. 

In fact, aside from the essentially negligible "low" labels along the 
bottoTi of the graph in Figure 8, the two categories that emerge are the high 
and rising tones. Interestingly enough, the stronger of the two categories is 
the high tone. Apparently, these less abrupt rises, compared with those of 
Experiment 3 (Figures 5 and 6), bias the response toward the high tone. 
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EXPERIMENT 3: FO CONTOURS 
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Figure 5. Seventeen contours moving from 90 Hz to endpoints ranging from 
90 to 152 Hz. 
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Figure 6. Experiment 3: Identification of the contours of Figure 5 by 38 
110 subjects. 
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EXPERinENT 4: FO CONTOURS 
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Figure 7. Sixteen contours starting at points ranging from )0 to 152 Hz, 
all ending at 152 Hz. 
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Figure 8. Experiment ^: Identification of the contours of Figure 7 by 38 
subjects. Ill 
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EXPERinENT 5: FO CONTOURS 
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Figure 9. Sixteen level contours ranging from 92 to 152 Hz. 
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Figure 10. Experiment 5: Identification of the contours of Figure 9 by 37 
1^2 subjects (adapted from Figure 2 in Abramson 1978). 
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Experiment 5 

Here, on tokens of synthetic [k^a:], there are 16 level contours, ranging 
from 92 to 152 Hz in iJ-Hz steps, as seen in Figure 9. These are undoubtedly a 
greater deviation fron natural speech than any of the foregoing contours; 
nevertheless, given the frequent assumption of "level" tones in the linguistic 
literature, it was important to see what the perceptual response to such 
stimuli would be. Indeed, the hypothesis expected cnly static tones, that is, 
the mid, low. and high tones. 

The results, first presented in Abramson (1978), are given in Figure 10. 
Only the mid, low, and high categories appear. There is much overlap, 
resulting in a lower peak for the mid tone than for the other two. 

Conclusion 

This study continues to support the primacy of the fundamental frequency 
of the voice as the carrier of tonal information in Thai, although some 
concomitant features may, in certain contexts, have at least secondary cue 
value. The "ideal" contours found in earlier work (Abramson, 1962; Erickson, 
197^; Candour^ 1975) are still quite acceptable for isolated Thai words. 

The new work has yielded some information on the perceptual latitudes of 
four of the tones. Level contours are fairly good for the static tones. For 
absolute levels to be so identified in citation forms of words in natural 
speech, there must be some auditory accanmodation to the speaker's voice range 
(Abramson, 1976; Leather, 1983)f as well as to the immediate tonal context. A 
comparison of Figures 8 and 10 does reveal, however, that the high-tone 
percept is improved by F^ movement. (Similar observations were made for the 
mid and low tones in Abramson, 1978.) 

Fairly rapid movements are needed for the dynamic tones. This conclusion 
is supported here only for the rising tone, although data not presented here 
show the same effect for the falling tone. While the dichotomy between static 
and dynamic tones is thus not categorical, it does have some perceptual 
support. 

There is more work to be done on the tonal space for Thai and other 
languages. The present findings seem compatible with the pitch features 
isolated by Candour (1978) and the emphasis on the importance of the onset 
frequency values of the contours for Thai by Saravari and Imai (1983). Of 
course, in running speech, all this is further complicated by inter*actions 
between sentence intonation and the tonal space (Abramson & Svastikula, 1983). 
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P-CENTERS Alifi UNAFFECTED BY PHONETIC CATEGORIZATION* 

Andr^ Maurice Cooper, t H. Whalen, and Carol Ann Fowlertt 



Abstract s The perceived onset (P-center) of a word typically does 
not correspond to its acoutsic onset (Marcus, 1981; Morton, Jarcus & 
Franklsh, 1976). Some researchers have suggested that the P-center 
of a word is solely a product of the acoustic characteristics of the 
word, while other have suggested that a word's P-center is 
determined by its phonetic characteristics. The present series of 
experiments pits a continuously varying acoustic parameter against a 
categorical phone^;ic percept in order to determine whether P-center 
location Is sensitive to the phonetic identity of the prevocalic 
segments of a sy^Mable. With a /5a/-/5a/~/ta/ continuum and three 
different /sa/-/sta/ continua, we find that phonetic judgments are 
categorical but P-center Judgments are continuous. The results 
demonstrate that P-center location is not determined by the phonetic 
identity of syllable initial consonants* Nor, however, is it 
determined by the rise time or the an;plitude envelope of tne signal 
as Howell (198^) has suggested. Instead, as Morton et al. and 
Marcus recognized, a combination of at least two different parts of 
the signal is at work, namely, the duration of the prevocalic 
consonant or consonants and, to a lesser extent, the duration of the 
syllable rhynoe. Whereas the relevant dimension of each ccmporent of 
the syllable is duration, acoustically defined, the partitioning of 
the syllable is phonetically motivated. Thus, both the phonetic 
structure of a syllable and the particular acoutic realizations of 
its structure affect the location of the P-center. 

When listeners are presented with sequences of con sonant -vowel syllables 
differing in the number or the nature of consonants and with equal intervals 
between their acoustic onsets, they Judge the rhythm of the sequences to be 
irregular. Furthermore, when given the opportunity to adjust the relative 
timing of two syllables until they are perceptually isochronous, listeners 
introduce systematic deviations from acoustic onset isochrcny (Morton et al., 
1976). These deviations can not be explained by reference to any obvious 
acoustic events such as the peak intensity of an utterance or the acoustically 
marked onset of the stressed vowel (see Allen, 1972; Morton et al., 1976; 
Rapp, 1971). 

These findings indicate that the event that listeners attend to when 
judging relative timing is opaque to conventional measurement techniques. 



^ Perception & Psychophysi cs , 1986, 39» 187-196. 
tAlso Department of Linguistics, Yale University. 
ttAlso Department of Psychology, Dartmouth University. 
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Therefore, Morton et al. (1976, p. ^05) do not attempt to locate the event 
absolutely, but propose that listeners base their rhythmicity Judgments on a 
word's "psychological moment of occurrence" or its "P-center" (perceptual 
center). According to Marcus (1981), although the P-center of a word cannot 
be determined absolutely, it can be located relative to the timing of other 
speech or nonspeech events. Tnus, by hypothesis, in order for two syllables 
presented in continuous alternation to be perceived as isochronous, the 
components of the sequence must have their P-centers at equal intervals. 

Syllables whose Initial consonants differ in manner generally have 
different P-center locations (Fowler & Tassinary, 1981). These consonants, in 
turn, have different acoustic characteristics, especially in the duration of 
the signal before the first vocalic pitch period (loosely, "consonant 
duration"). Marcus (1981) found that the location of the P-center is highly 
correlated with initial consonant duration. Specifically, the shorter the 
duration of the consonant, the earlier the P-center with respect to the 
acoustic onset of the syllable (also see Rapp, 1971 ; Fowler & Tassinary, 
1981). Marcus also found that changes in the duration of segments following 
the initial consonants are associate., with a smaller yet significant change in 
the location of the P-center. These two relationships are expressed by the 
equation: 



where x is the measured duration of a syllable onset (that is, the part of the 
signal preceding the first oral pitch pulse), y is the duration of the 
syllable rhyme (the vocalic segment and final consonants) and k is an 
arbitrary constant reflecting the fact that the equation predicts the 
relative, rather than the absolute, location of the P-center. Although the 
equation accounts for about 90% of the variability of P-center locations in 
the set of digits one to nine, it does not explain this variability. 

In fact, one issue that the equation leaves in question is whether 
P-center shifts are explained by the phonetic or by the acoustic properties of 
a word. According to Marcus's equation, P-centers have a phonetic basis to 
the extent that the equation predicts that syllable-initial consonants have a 
markedly greater affect on P-center location than do vowels (whether syllable 
initial or nonsyllable initial) or final consonants. However, Marcus also 
shows that durational changes that do not affect the phonetic identity of the 
segments in a word do affect P-center location. Accordingly, the P-center is 
not solely a product of the phonetic identity of a segment. 

Marcus attempted to test the effect of phonetic identity on P-center 
location by pairing the members of a phonetic continuum with several reference 
stimuli and adjusting the relative timing of the stimulus pairs to isochrony. 
The continuum was created by deleting successive portions of the /s/-noise (in 
30-ms decrements) from a naturally-produced token of the word /S£van/. The 
initial consonant of the continuum stimuli spanned three phonetic categories, 
viz., /s/, /ts/ and /d/ (presumably as Judged by Marcus, 1976, himself). 
Marcus found that abrupt shifts in phone categorization across the phonetic 
continuum were accompanied by a continuous change in P-center location across 
the continuum. Marcus's continuum, however, was not sufficiently well 
constructed to address the issue adequately. First, no identification or 
discrimination tests were performed to confirm how the continuum was 
perceived. Second, the steps of Marcus's continuum were so large that the 



P = .65x + .25y + k. 
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initial consonants spanned three phonetic categories (including one, /ts/, 
that is not phonotactically possible in English) within four steps of the 
five-step continuum. 

In the present study, we used a series of phonetic continua to 
investigate the extent to which the phonetic realization of a syllable-initial 
consonant is relevant to P-center location. Our experiments improved upon and 
extended Marcus* s experiment in several ways . First, we obtained 
identification and discrimination data to confirm that the continuum stimuli 
were categorically perceived. Second, the acoustic differences between 
neighboring stimuli on the continuum were sufficiently small to enable us to 
address the question of the relationship between phonetic consonantal 
categories and the location of the P-center. Third, all of the phonetic 
categories under investigation, /S/-/S/-/t/ and /s/-/st/, occur 
syllable-initially in English, the native language of the listeners. The 
phonetic categories /s/ and /st/ had already been shown to have P-center 
differences on the order of 26 ms (Fowler & Tassinary, 1981). Finally, we 
manipulated both prevocalic and postvocalic durations, allowing us to test 
Mar*cus*s equation directly. Our results also allowed us to address Howell* 3 
(198^) recent suggestion that the amplitude envelope of a syllable 
significantly affects the location of its P-center. 

We compared the phonetic categorization of syllable initial consonants 
and the relative location of P-centers to determine their 3ffects upon each 
other. The test stimuli consisted of a /Sa/-/?Ja/-/ta/ continuum and three 
/sa/-/sta/ continua. The construction of our first continuum was guided by 
the fact that when a sufficient amount of frication is deleted from /Sa/, 
listeners hear /5a/ rather than /Sa/; as additional frication is deleted, 
listeners eventually hear /ta/. The construction of the remaining continua 
was based on the fact that when a sufficiently long silent gap is introduced 
between the frication and the vocalic segment of the /sa/ syllable, listeners 
hear /sta/ rather than /sa/. 

Repp (198^) identified four criteria that delimit categorical perception: 
there must be (1) an abrupt shift in labeling probabilities somewhere along 
the continuum, (2) a peak in the discrimination function at the category 
boundary, (3) chance or near-chance level discrimination of stimuli within 
categories, and (^) perfect predictability of the discrimination function from 
the identification function. Strict categorical perception, as described 
above, is rarely, if ever, reported in the literature; instead, the actual 
data approximate the ideal more or less well. Repp (198^) emphasizes that 
provided that the other criteria are not severely violated, a peak in the 
discrimination function at the category boundary is the crucial defining 
characteristic of categorical perception. 

Three relationships between the categorical nature of the continua and 
the P-center function are possible. First, the relative location of the 
P-center could be sensitive only to the phonetic properties of the syllables. 
In this case, listeners* P-center judgments would be predictable from their 
identification functions. For syllables with categorically-perceived 
consonants, this would imply negligible within-category P-center shifts, but 
noticeable shifts between categories. Second, the P-center could be sensitive 
to the durations of a syllable*s acoustic segments. In this case, listeners* 
P-center judgments would vary monotonically as a function of duration (see 
Marcus, 1981). The third possibility is that P-center judgments would be 
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influenced by both phonetic and acoustic properties of the stimuli, resulting 
in both an abrupt shift in the P-center at the category boundary and 
systematic variation within categories. 

General Procedures , Experiments 1 

Each experiment consisted of three tasks. The first two tasks, a 
forced-choice identification test and an AXB discrimination test, were used to 
determine the extent to which each continuum was categorically perceived. For 
the identification test, multiple repetitions of the stimuli were randomized 
and presented to listeners. For the AXB test, multiple repetitions of all 
pairs of syllables in a continuum differing by one step (Experiment 1) or by 
two steps (Experiments 2 - ^) were randomized and presented to listeners. 

The final task, the alignment test, was designed to measure the relative 
P-center location of the test stimuli. For the alignment test, each of the 
test stimuli was paired with a reference syllable /ba/ for presentation. (The 
reference syllable was 3^9 ms in duration). The syllable pairs were played to 
listeners in a continuous sequence under computer control. The temporal 
position of the second tiyllable relative to the first was adjustable within a 
window of fixed duration. Initially, on each trial, there was a 50-ms gap 
between offset of the fixed syllable and the onset of the movable stimulus. 
The listener's task was to adjust the timing of the sequence until it was 
perceived as isochronous. When the listener was satisfied with the 
adjustment, the computer reported the interval between the acoustic onsets of 
the stimuli. Two systems, with minor differences, were used. On one system, 
implemented with a New England Digital computer at Dartmouth College, 
listeners adjusted the second syllable in steps of 15 ms, 5 ms or 1 ms in 
either direction relative to the fixed syllable by pressing designated keys on 
a computer terminal keyboard. On the other system, implemented with a DEC 
GT^O computer' at Haskins Laboratories, the alignments were made by turning a 
knob. The analog output of the knob was digitized to indicate adjustments in 
12.8-ms increments. Since the experimental results obtained from the two 
systems were similar, they were combined. 

Experiment J_ 

The purpose of the first experiment was to investigate how the location 
of P-oenters might vary across a categorically perceived /Sa/-/J5a/-/ta/ 
continuum. 

Method 

Stimuli . Using a waveform editor, a 10-step /ga/-/Ca/-/ ta/ continuum was 
created by deleting 15-ms increments of frication fran the acoustic onset of a 
digitized naturally spoken /ga/ syllable. The fricative segment of the 
original /Sa/ was 189 ms in duration; the vocalic segment was 266 ms in 
duration. Syllable duration covaried with the duration of /S/ in the 
continuum. To minimize abrupt onsets, an amplitude ramp, linear with sound 
pressure, was applied to the onset of each stimulus. The offset of the ramp 
was fixed at 150 ms into the frication of the original /ga/, while the total 
duration (and steepness) of the ramp varied with the duration of the 
frication. Thus, for one extreme of the continuum, the ramp was applied to 
the initial 150 ms of the original stimulus. For the other extreme, the first 
135 ms of noise was deleted from the frication and a linear taper was applied 
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to the initial 15 ms of the frication (see Figure 1). Although the onset 
ramps in our continua do become steeper as the continuum stimuli get shorter, 
this seemed to us an improvement over the procedure of Marcus (1981), who made 
no attempt to avoid abrupt onsets. In his continuum, increasing amounts of 
fricative energy were simply removed from the beginning of the word "seven," 
resulting in abrupt onsets, and hence the perception of /ts/ very early in the 
continuum. 



ENDPOINT STIMULI, EXR 1 




Figure 1. Continuum endpoint stimuli for Experiment 1: unmodified stimulus 
(upper panel) p extreme stimulus manipulation (lower panel). 

The identification test consisted of a randomized sequemce of 10 
repetitions of each member of the continuum (10 x 10 = 100 trials). The 
categories allowed in the identification test were /t/, /?/ and /S/. A 
one-step AXB discrimination test consisted of a randomized se uence of five 
repetitions of the four versions of each of the nine pairings of the stimuli 
(9 X 4 X 5 = 180 trials). For the alignment test, twenty judgments were 
obtained for each member of the continuum (20 x 10 = 200 trials). 

Subjects . Three subjects participated. Two were naive as to the 
purposes of the experiment and the third was one of the authors (CAF). 

Results and D iscussion 

Figure 2 shows the mean identification and discrimination functions for 
the three subjects. The ordinate represents the percent identification and 
the percent of correct discrimination. The abscissa represents the members of 
the continuum. 1 19 
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Figure 2. Mean identification and discrimination functions (pooled across 
three subjects) for /Sa/-/5a/-/ta/ continuum, for Experiment 1. 

The identification function shows two abrupt shifts in phoneme 
identification. The mean category boundaries (the 50? crossover points of the 
identification function) occur at 115 ms of frication for /Sa/"/iJa/ for p11 
subjects and at ms of frication for /Sa/"/ta/ for the two subjects who 

reported /ta/'s. The discrimination function shows peaks in discrimination 
near the mean category boundaries, indicating that listeners discriminate 
better between stimul < th&t straddle a phoneme boundary than between stimuli 
that fall within the same phonen:3 category. Our data show two departures from 
strict categorical perception that are frequently reported in the literature. 
First, our discrimination peak is slightly offset from the category boundary, 
and second, within-oategory discrimination is above chance level (see Best, 
Morrongiello, & Robson, 1981 ; Healy & Repp, 1982; Liber-nan, Harris, Eimas, 
Lisker, & Bastian, 1961; Liberman, Harris, Hoffman^ & Griffith, 1957; 
Liberman, Harris, Kinney, & Lane, 1961). Nevertheless, the pattern of our 
data is similar to the patterns obtained in earlier studies in which 
researchers concluded that perception was categorical. 

The mean results of the three subjects' performance on the alignment test 
are shown in Figure 3» The ordinate represents the displacement from acoustic 
isochrony of the test stimuli relative to the reference syllable in ms (i.e., 
the interval from the acoustic onset of the test stimuli to the acoustic onset 
of the reference syllable minus one half the window size; thus, this measure 
would be zero for stimuli aligned at cheir acoustic onsets). The abscissa 
represents the duration of the fricative noise in ms. 
120 
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Figure 3. Mean P-center alignment function for Experiment 1, pooled across 
three subjects. The solid line represents the regression line and 
the arrow's indicate the category boundaries. 

In this experiment, P-center location, as shown by the regression line, 
moves linearly toward stimulus offset as the duration of the noise increases 
across the continuum. The slope of the regression line is .95, with r •= .73. 
(The slopes of the ?~center regression lines for the indivic^ual subjects are 
.95, 1.07, and .8^; the correlations are .80, .91, and .57, respectively. 
Each correlation is significant at the .001 level). Thus, there is 
essentially (^ millisecond shift in P-center location for every millisecond of 
frication deleted. 



There is no abrupt shift in P-center location at the category boundaries 
(which are indicated by the arrows). This indicates that the phonetic 
identity of syllable- initial consonantal segments does not affect P-center 
location. That there is no phonetic (or any other) source of nonllnearity in 
the data is revealed by a gooan^ss-of -f it test. This test reveals that the 
first-degree polynomial is the highest one to significantly reduce the 
residual suin of squares. The significant F, F( 3,596) = 2^3.8, £ < .001, for 
degree 0 indicates that there is systematic variance unaccounted for at that 
level; the^ nonsignificant F for degree 1, F(2,596) = .33, n.s., indicates that 
the linear function accounts for as much of the variance as any higher 
polynomial. Thus, the point on the graph just after the category boundary 
does not signal a departure from linearity. 
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Experiment 2 

The purpose of the second experiment was to replicate and extend the 
results of the first experiment using a /sa/-/sta/ continuum. In contrast to 
Experiment 1, the onset characteristics (including the rise time) of the 
stimuli in this experiment were left unaffected by the experimental 
manipulation. By eliminating the confounding effects of rise time, we are 
able to address Howell *s (198^) claim that the amplitude envelope of a 
syllable (manipulated in his experiment by varying the syllable's rise time) 
significantly affects its P-center location. 



Method 



Stimuli .. An elever-step /sa/-/sta/ continuum was created by inserting 
lO^ms increments of silence between the frication and the vocalic segment of a 
naturally spoken /sa/ sylla^^le. The frication of the original syllable was 
206 ms in duration and the vocalic segment was 360 ms in duration. The first 
stimulus of the continuum was the original /sa/. The final stimulus contained 
a 100-ms gap between the frication and the vocalic segment (see Figure ^). 



ENDPOINT STIMULI, EXP. 2 



/sa/ 
ENDPOINT 
(ORIGINAL) 



/sta/ 
ENDPOINT 




Figure ^. Contin»":m endpoint stimuli for Experiment 2: unmodified slimulus 
(upper panel), extreme stimulus manipulation (lower panel). 

The identification test consisted of a randomized sequence of 20 
repetitiono of each member of the continuum (20 x 11 = 220 trials). A 
two-step AXB discrimination test consisted of a randomized seq lence cf four 
repetitions of the four versions of each of the nine pairings the stimili, 
within the AXB paradigm (9x^x^ = 1^^ trials). A two--step discrimination 
tBst was used rather than a one-step discrimination test (as in Experiment 1) 
because we found 10-ms differences between stimuli to be too small for 
listeners to tc discriminate consistently. (The stimuli in the one-step 
comparisons in Experiment 1 differed by 15 ms.) For the alignment test, 12 
judgments were obtained for each member of the continuunj (12 x 11 « 1 32 
trials). 
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Subjects , Three subjects participated in the experiment. One subject was 
naive as to the purposes of the experiment. The other two subjects were two 
of the authors (AMC and CAF) • 

Results and Discussion 

Figure 5 shows the mean identification and discrimination results for the 
three subjects. The ordinate represents the percentage of /sta/ responses for 
the identification data and percent correct for the discrimination data. The 
abscissa represents gap duration. The identification function shows an abrupt 
shift in phoneme identification. The mean category boundary occurs at about 
60 ms of silence. The discrimination function shows a peak in discrimination 
near the mean category boundary and troughs in performance within categories. 
(The high discriminability of the first stimulus, when compared to the third, 
the point enclosed in parentheses, may have occurred because listeners were 
able to distinguish the unmodified stimulus from a stimulus that had been 
modified). 

The mean results of the three subjects' performance on the alignment test 
are shown in Figure 6. The ordinate represents the displacement from acoustic 
Isochrony of the test stimuli relative to the reference syllable In 
milliseconds. The abscissa represents the amount of silence inserted into the 
test stimuli in milliseconds. 

In this experiment P-center location, as shown by the regression line, 
moves linearly toward stimulus offset across the continuum with a slope of 
1.00, r ,75. (The slopes of the P-center regression lines for the 
individual subjects are 1.03, .9^ and 1.0^; the correlations are .82, .71 and 
.77. Each correlation is significant at the .001 level). Thus, as in 
Experiment 1, there is a 1-ms shift in P-center location for every millisecond 
of change in gap duration. The phonetic identity of syllable-initial 
consonantal segments does not affect the P-center; that is, there is no abrupt 
shift in P-center location at the category boundary (i.hich is indicated by the 
arrow). A goodness-of-f it test provided an outcome analogous to that 
performed on the data from Experiment 1. Thus, there is no sigiiificant 
departure from linearity in the data. 

The results of both Experiments 1 and 2 demonstrate a 
millisecond-f or-millisecond shift in the P-center for each stimulus 
manipulation. Contrary to Howell's (198^) suggestion, in this experiment the 
P-center varied even though the rise time of the test stimuli remained 
constant. We have yet to determine, hov/ever, what aspect of the changing 
durational pattern accounts for our results. In particular, we want to know 
whether P-center shifts are a function of changes in gap size, of changes in 
overall stimulus duration, of changes m the duration of the prevocalic 
segment of the syl?.able, or of changes in the temporal location of the 
acoustically defined vowel: all of these change linearly aiid at the same rate 
in Experiments 1 and 2. The remaining experiments are designed to distinguish 
among these alternatives. 

Experiment 3 

Experiment 3 was designed to control for possible influences of variation 
in overall syllable duration on the P-center results of Experiment 2. 
Stimulus duration was held constant by excising an equal amount of the 
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Figure 5. Mean identification and discrimination function (pooled across 
three subjects) for /sa/-/sta/ conuinuum, for Eyperiwent 2. 



290 



fi 280 



P-CENTtR FUNCTION, EXP 2 




20 30 40 50 60 70 80 90 :00 
GAP DURATION (msec) 



Figure 6. Mean P-center alignment function for Experiment 2, pooled across 
three subjects. The solid line represents the regression line and 
the arrow indicates the category boundary. 
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frication to offset the amount of silence inserted. If the shift in P-center 
noted in Experiment 2 was prhT)arily related to gap duration, then the P-center 
functions of Experiments 2 and 3 should be equivalent. If, however, the 
observed shift in P-center is due either to prevccalic duration or to total 
duration of the syllable, then the P-center location should not change acrc3S 
the continuum. 

Methods 

Stimuli . The extreme stimuli of Experiment 3 are presented in Figure 7. 
The center waveform shows the urjnodified /sa/ used in Experiments 2 - ^. In 
the final member of the continuum, represented by the upper waveform, 100 ms 
of silence has been inserted between the frication and the vocalic segment and 
100 ms of frication has been deleted. For each stimulus in which silence was 
inserted, a compensatory amount of fri^iation was excised beginning at a ooint 
72.2 ms into the frication. This location was chosen because it allowed us 
both to maintain the original onset and offset characteristics of the 
frication and to excise a substantial annount of noise from within the 
fricative segment. The identification, discrimination, and alignment tests 
were organized as in Experiment 2. 



ENDPOINT STIMULi, EXP'S 3 and 4 



/sta/ 
ENDPOINT 
(EXR3) 



/sa/ 
ENDPOINT 
(ORIGINAL) 



/sta/ 
ENDPOINT 
{EXR4) 




Figure 7. Continuum endpoint stimuli for Experiments 3 & ^: unmodified 
stimulus (center panel) , extreme stimulus manipulation for 
Experiment 3 (upper panel ) , extreme stimulus manipulation for 
Experiment ^ (lower panel). 



125 



EKLC 



132 



Cooper et al.: Effect of Phonetic Categorization on P-centers 



Subjects , T.^e subjects were those of Experiment 2. 
Results and Discussion 

The identification data showed an abrupt shift in perceived phoneme 
category at 55 ms, and the discrimination data showed that discrimination is 
somewhat better near the category bouiidaries than within categories (Figure 



The alignment test for Experiment 3 (Figure 9) shows no significant change 
in P'center location across the continuum. The slope of the regression line 
is -0.003; the correlation is « O.OOi^. (The slopes of the P-center regression 
line for the individual subjects are -0.13, 0.09, and 0.02; the correlations 
are -0.:^, 0.18, and 0.03, respectively). This result shows that the linear 
shift in the P-center noted in Experiment 2 is not due to the increases in gap 
duration, per se» Nor, consistent with Experiments 1 and 2, is P-center 
location affected by the phonetic identity of syllable initial prevocalic 
segments. And finally, this experiment also shows that P-center shifts are 
not necessarily affected by a syllable's amplitude envelope (as suggested by 
Howell, 1984), since the envelopes of the stimuli varied although the P-center 
did not. 

The canceling of the effect of the manipulrtion of gap duration on P-oenter 
location may be ascribed to the canceling of its effect on total syllable 
duration, to the canceling of its effect on the duration of the prevocalic 
consonant cluster, or to the canceling of its effect on the onset time of the 
vocalic .segment. Experiment H is designed to distinguish among these 
alternatives. 

Experiment ^ 

In Experiment 4, syllable duration was held constant, as in Experiment 3, 
however, but here compensation for increases in gap duration was achieved by 
shortening the vocalic segment of the syllable rather than the frlcation. If 
the variation in duration of the prevocalic segment or the onset time of the 
vocalic segmeiit (acoustically defined) is principally responsible for the 
P-center shifts noted in Experiment 2, then the P-center functions of 
Experiments 2 and 4 should be equivalent. If total duration is responsible 
for the shifts in P-center, then the P-center should remain constant across 
the continuum (as In Experiment 3). 

Method 

Stimuli . The stimulus manipulations in Experiment 4 are illustrated in 
Figure 7. The center waveform represents the unmodified /sa/ used in 
Experiments 2-4. '^he lower waveform represents the final stimulus used in 
Experiment 4, in which 12 pitch periods were excised and 93 ms of silence were 
inserted. In the continuum, syllable duration was held constant by excising 
successive pitch pulses from the vocalic segment and then inserting 
compensatory amounts of silence. By excising complete pitch pulses from the 
';:^owel rather than excising an arbitrary amount of the vocalic segment, we 
avoided abrupt discontinuities in the periodic part of the signal. Pitch 
pulses were extracted beginning 67.7 ms into the vocalic segment. This 
location was both within the steady-state portion of vowel and beyond the peak 
intensity of the vowel. The individual pitch pulses ranged from 7.5 to 8.0 ms 
in duration. The identification, discrimination, and alignment tests were 
organized as in Experiment 2. 
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Figure 8. 



Mean identification and discrimination function (pooled across 
three subjects) for /sa/-/sta/ continuum, for Experiment 3. 
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Mean P-center alignment function for Experiment 3, pooled across 
three subjects. The solid line rep»"esents the regression line and 
the arrow indicates the category boundary. 
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Subjects , The subjects were those of Experiments 2 and 3. 
Results and Discussion 

The Identification data showed an abrupt shift in perceived phoneine 
category at ^7 ms, and the discrimination data showed that discrimination was 
better near the category boundaries than within categories (Figure 10), 

The results of the alignment test for Experiment ^ (Figure 11) show a 
linear shift in the P-center toward stimulus offset. The arrow indicates the 
category boundary. The slope of regression line is .83 with r = .68. (The 
slopes of the P-center regression lines for the individual subjects are .80, 
.80, and .88; the correlations are .88, .58, and .72, respectively. Each 
correlation is significant at the ,001 level.) A dashed line with a slope of 
one is shown for comparison. A goodness-of-f it test reveals that the function 
is linear with no significant departure from linearity. 

In this experiment, the P-center shifts toward stimulus offset, but the 
change is less than 1 ms of shift in the P-center for each millisecond of 
change in the gap duration. A paired t test comparing the slopes of the 
subjects' regression lines of Experiment 2 with those of Experiment ^ shows 
that these slopes are significantly different, t(2) « 6.^7, 2. < -05. For the 
most extreme stimulus in Experiment ^, a gap size of 93 ms results in a 
P-center shift of 7^.5 ms, whereas, in Experiment 2, a comparable gap size of 
90 ms, shifts the P-center 91.^ ms~a difference of about 17 ms. 

Our results are in agreement with those of Marcus (1981) who showed that 
P-center location is determined by the temporal makeup of the entire stimulus. 
We can interpret the results of Experiment ^ as follows. Increases in the 
duration of the prevocalic segment cause the P-center to shift toward stimulus 
offset, as it did in Experiments 1 and 2. Compensatory decreases in vowel 
duration, however, shift the P-center toward stimulus onset, although the 
magnitude of the shift is less. This interpretation is supported by Marcus 
(1981), who has shown that as the vocalic segment in CV syllables decreases in 
duration, the P-center shifts toward stimulus onset. 

The results of Experiment ^ also show that P-center location cannot be 
simply associated with acoustically defined vowel onsets. For if the onset of 
the vowel were correlated with the P-center results in the previous 
experiments, then there should have been a millisecond shift in P-center 
location for every millisecond that the vowel onset shifted in the present 
experiment. Our results, however, clearly show that the correlation between 
P-center location and vowel onsets is not. perfect and that P-centers cannot be 
linked exclusively to any single acoustically defined event in the speech 
signal. Other investigators have also shown that P-center location seems to 
be correlated with something other than a word's acoustically defined vowel 
onset (Allen, 1972; Fowler & Tassinary, 1981; Rapp, 1971). In Allen's study, 
subjects were required to tap "on the beat" of a specified syllable in a 
sentence. In Fowler and Tassinary' s study, subjects were asked to produce 
rhyming nonsense syllables in time to a metronome. In Rapp's study, subjects 
were also asked to produce nonsense syllables in time with a regularly 
occurring pulse. In each of the studies, the pulse or tap both preceded the 
acoustic onset of the vowel and was positively correlated with the duration of 
the initial consonant or consonant cluster. 
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Figure lO, Mean identification and discrimination function (pooled across 
three subjects) for /sa/-/sta/ continuum, for Experiment ^. 
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Mean P--center alignment function for Experiment ^, pooled across 
three subjects. The solid line represents the regression line, 
the broken line has a slope of 1 for comparison and the arrow 
indicates the category boundary, 129 
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Summary and General Cone lusion 

Our results demonstrate that a listener's judgments of tt P-center of a 
syllable are not affected by the phonetic identity of the prevocalic segment 
or by any obvious acoustic properties of the signal, such as gap duration or 
simply the overall duration of the stimulus. Instead, the P-center appears to 
have been determined by a combination of at least two different aspects of the 
signal, the duration of the prevocalic segment and, to a lesser extent, the 
duration of the vocalic segment. 

Our results also bear upon rationales for P-center shifts based on the 
amplitude envelope of speech stimuli. Howell (198^) performed experiments in 
which the onset intensity envelope of a /ga/ syllable was altered. He claimed 
that this manipulation was a '^sufficient" source of variation in P-center 
location. Furthermore, he suggested that all of Marcus's manipulations 
(Marcus, 1981) in which the P-center was altered could be attributed to an 
alteration of the distribution of energy in the amplitude envelope. Although 
temporal Judgments of nonspeeeh stimuli are influenced by their rise time 
characteristics (Howell, 198H; Vos & Rasch, 1981), this explanation does not 
account for P-center shifts in speech stii?3ull. For example, when the onset 
envelope of the test stimuli remained unaltered (Experiments 2 and ^), the 
P-center location varied; in contrast, when the onset envelopes of the test 
stimuli were altered, (Experiment 3), the P-center did not vary in location. 
Consistent with this finding, Tuller and Fowler (1980) radically changed the 
amplitude envelope of speech syllables by infinite peak clipping and founu no 
shift in the P-center. Finally, Marcu? (1981) found that increases in the 
silent interval for the dental stop In the word "eight" shifted the P-center 
toward the acoustic offset of the word, but that increases in the amplitude of 
the release burst did not affect the location of the P-center. 

Although our findings show that the phonetic identity of syllable-initial 
consonants does not affect the locatiovi of the P -center, they do not rule out 
any possible effect of the phonetic structure of a syllable on P-center 
location. Given that P-center location shows a preci.^e 

millisecond-f or-millisecond relationship with the duration of prevocalic 
segments in a syllable, but Is affected to a markedly smaller extent by the 
duration of the vocalic segment, the results of Experiment 3 are open to both 
phonetic and acoustic interpretations. One interpretation is that this 
diminished durational effect on i'^-center location occurs abruptly Just at the 
point in a syllable where the listener's perception of prevocalic segments 
gives way to perception of the vowel. We would Identify this effect as 
phonetic and, accordingly, we would predict that manipulations of vowel 
duration in vowel-initial syllables would have effects on P-center location 
equivalent to the vocalic effeota In Experiment ^. An alternative explanation 
is that the effects of changes in duration are weaker th^ farther away they 
are from the syllable onset, We would identify such an effect as acoustic, 
and would expect the effects of durational manipulations of vowels to be 
greater in vowel-initial syllables than In those with initial consonants. We 
are currently investigating whother the phonetic quality of syllable 
constituents as well as the serial position of those constituents within a 
syllable affect the location of the P-center, 
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TWO CHEERS FOR DIRECT REALISM* 
Michael Studdert-Kennedyt 



"Beware Procrustes, bearing Occam's razor." 

— Lise Menn 

I am very much in sympathy with Fowler's approach (henceforth, CAF) 
because it is grounded in a functionalist, biological view of language. No 
doubt the approach will be faulted, despite its disclaimers, for narrowly 
focussing on phonetic structure. Yet what is new in CAF is precisely its 
scope: the range of phonetic fact for which it takes responsibility. Basic 
research in speech perception and basic research in speech production (no less 
than applied research in speech synthesis and machine recognition) have tended 
to follow parallel lines. Perceptual research typically manipulates acoustic 
variables with little regard for articulatory constraints, while production 
research typically studies the actions of individual muscles or articulators 
with little concern for how they are coordinated to yield a perceptually 
coherent acoustic signal. By adopting a single abstract unit (corresponding 
to the phoneme-sized phonetic segment) as the presumed functional element of 
both production and perception, CAF lays the ground for a program of research 
responsible to both. Nor is it coincidence that the selected unit is 
potentially alphabetic. For CAF thus acknowledges that our accounts of 
speaking and listening must be consistent with the facts of writing and 
reading. 

A signal virtue of CAF, then, is that it accepts responsibility for the 
segmental structure of all four modes of language action: like any good 
theory, it proposes to unify (eventually) related classes of fact that are 
commonly treated as separate. The faults of CAF largely stem, I believe, from 
a somewhat toe zealous attempt to impose a framework, devised to handle an 
animal's traffic with the physical world, on a communication system with a 
quite different evolutionary history and function. 

CAF includes three assumptions that need to be modified or, at least, 
explicated: (1) perception is "unmediated by cognitive processes of 
inferencing or hypothesis testing"; (2) listeners "extract information about 



^Journal of Phonetics , 1986, 99-10^. Commentary on Fowler, C. A.: An 
event approach to the study of speech perception fron a direct-realist 
perspective. Journal of Phonetics , 1986, 1_4, 3-28. Also SR-85 , this volume. 
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thanks to the Spencer Foundation for financial support, and to Bj^5rn Lindblom 
and Peter MacNeilage for discussion and comments. 
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articulation from the acoustic speech signal"; (3) "it matters little through 
what sense we realize what speech event has occurred." My comments follow. 

Unmediated Perception 

A corollary of this assumptic" seems to be that the phonetic segment 
should not be constructed, in either perception or production, from smaller 
units. Accordingly CAF, invoking speech error data to support the choice of 
unit, implicitly dismisses "feature" errors as unimportant. Yet such errors 
do occur, with some low frequency, and have to be accounted for. Voicing 
metathesis seems to be the most common (e.g., clear blue 3ky glear plue sky 
(Fromkin, 1971)), 'out place metathesis also occurs (e.g., pedestrian 
tebestrian (Fromkin, 1971 ) ; wild goose chase wild Juice case (Robert Remez, 
personal communication). These errors are interesting because they reflect a 
level of organization below the segment. 

The possibility of such errors is implicit in CAF's definition of a 
phonetic segment as a "set of coordinated gestures." Elsewhere, Fowler and her 
colleagues (Fowler, Rubin, Remez, & Turvey, 1980) treat the phonetic segment 
as a set of nested, or embedded, coordinative structures that arise as 
functional groupings of ri iscles, marshalled for moment-to-moment control of 
speech. The coordinative structures of Fowler et al. evidently correspond to 
the gestures of CAF. Similarly, Kelso, Saltzman, and Tuller (1986) discuss 
the task-specific grouping of muscles to execute a gesture, nested within the 
CV syllable. CAF, quite properly in my view, regards these gestures as 
non-linguistic (or non-phonetic): "lip closure per se is not an articulatory 
speech event." Lip closure only becomes phonetic (i.e., only performs a 
linguistic function) by virtue of its coordination with other non-phonetic 
gestures in an appropriate linguistic context. 

A speaker, then, is engaged in moment-to-moment marshalling of 
intrinsically funct lonless muscle systems to fulfill a phonetic fubction — much 
as a tennis player marshalls muscles to execute a tennis stroke. A skilled 
speaker has a repertoire of routinized processes that assemble non-phonetic 
gestures into phonetic segments. Errors in gestural assemblage may then be 
rare because the process occurs with very high frequency, so that a given 
gesture is called into a phonetic segment even more frequently than a phonetic 
segment is called into a syllable. Errors in the process may also be rare due 
to tight anatomical and physiological constraints on gestural coupling : 
Voicing metathesis is perhaps the most common error because voicing is 
relatively loosely coupled to supralaryngeal action. In any event, by this 
account, a gestural error is motoric, a segmental error phonetic. 

Consider now the child learning to speak. Its task is to discover how to 
marshall its repertoire of non-linguistic babbling gestures for linguistic 
use. Its first linguistic (functionally communicative) segments are words or 
formulaic phrases. The child evidently perceives these units as constructed 
from non-linguistic gestures. For example, Ferguson and Farwell (1975) report 
the following attempts by a 15-month-old child to say the word pen ; 

r --e v~ , dn , m, h ^h ^h ^ h ^h N ^ -w, 
[ma , A, de hin, do, p in, t nt nt n, ba , ^ au , bua] . 

In these attempts, we find all the gestures required to utter pen ; lip 
closure, lingua-alveolar closure, tongue raising and fronting, velum raising 
and lowering, glottal narrowing and spreading. The gestures are misordered 
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and mistimed, but it is evident that the acoustic structure of the word did 
specify for the child the gestures that compose it. 

As the child develops, it will come to recognize recurrent gestural 
groupings as functional elements in speaking: the phonetic segment will 
emerge as the interface between non-linguistic gesture and linguistic word. 
Will the child thereby lose its capacity to perceive gestures? It would seem 
not. The speech error data demonstrate that the adult may produce gestures 
separately from the segmental structure in which they are normally embedded. 
If the perspectives of speakers a/id listeners are "interchangeable," as CAF 
proposes, listeners must assemble segments from non-phonetic, auditory markers 
in the signal no less than speakers assemble them from a non-phonetic gestural 
repertoire. This may not call for "inferencing or hypothesis testing" in 
perception, but it does call for some process less immediate than the word 
"direct" would seem to imply. 

Extracting Information about Articulation 

Direct realism presses CAF into "defining speech event interchangeably 
from the perspectives of talkers and listeners." For the definition to hold we 
must assume that the problem of functional equivalence among diverse motor 
patterns, in general, or of the many-to-one relation between articulation and 
acoustics, in particular, has been solved (cf. Kelso, et al., 1986/this 
volume). We could then be confident that articulation and acoustics are, at 
some abstract level of description, fully isomorphic: to every acoustic 
pattern of change in frequency and time there exactly corresponds an 
articulatory pattern of movement in space and time, and vice versa. 

Ironically, this assumption renders ambiguous much of the evidence cited 
to support it. To show that listeners extract information about articulation 
from the speech signal, CAF cites several studies in which listeners' 
perceptual judgments seemed to be in better agreement with the articulatory 
pattern tlian with the acoustic. Such findings are anomalous, if articulatory 
and acoustic patterns are isomorphic. For the "P-center" studies CAF resolves 
the anomaly by arguing that it arose from an error in the conventional 
acoustic measurements of vowel onset. Once the error was corrected, 
acoustics, articulation, and perception fell into line. 

An equivalent move in the /sllt-/spllt/ "trading relations" phenomenon 
would require systematic measurement of the articulatory correlates of 
acoustic silent interval (stop closure) and formant transitions (stop 
release). Such measurements have never been reported, so far as I know, and 
in the cited study they could not be appropriately made because the 
experiments were done with synthetic speech. Articulatory equivalence (or 
non-equivalence) was therefore inferred, with some circularity, from 
perceptual equivalence (or non-equivalence). However, if the appropriate 
measurements were done on natural speech, articulation, acoustics, and 
perception would, by the hypothesis of CAF, again fall into line. 

In short, if acoustics and articulation are fully isanorphic, they are 
merely notational variants. Whether we describe the listener as perceiving 
sound patterns or as perceiving articulatory patterns, is then a matter of 
theoretical taste. Direct perception of articulation becomes merely an axiom 
of a direct -realist theory. 
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Perhaps all this is sophistry. We know, after all, that listeners do 
extract information about articulation. How otherwise would every normal 
child come to speak the dialect of its peers? We know too from studies of 
"lip-reading" that acoustic and optic information about speech may combine in 
perception. These studies suggest that we are able to imitate or repeat the 
utterance of another because perception extracts an amodal pattern of 
information, isomorphic with the pat ern that controls articulation — just as 
CAF claims. What seems to be at issue then is not whether listeners can 
extract information about articulation, but whether they always do, and 
whether perception is direct, in the sense that the medium structured by 
articulation is transparent and a matter of indifference to the perceiver. 



The Medium of Amodality 



Each species of animal has a unique combination of perceptual and motoric 
capacities. Characteristic motor systems have evolved for locomotion, 
predation, consumptioHf and mating. Matching perceptual systems have evolved 
to guide the animal in these activities. The selection pressures shaping each 
species ' perceptuomotor capacities have come, in the first instance, from 
physical properties of the world. 

By contrast, these perceptuomotor capacities themselves must have played 
a crucial role in shaping the form of a social species' communication system. 
The general point was made by Huxley (191^) when he remarked that the 
elaborate courtship rituals of the great crested grebe must have evolved by 
selection of perceptually salient patterns from the bird's repertoire of 
motorically possible actions. Certainly, specialized neuroanatomical 
signaling devices have often evolved, but they have typically done so by 
modifying pre-existing structures just enough for them to perform their new 
function without appreciable loss of their old. The cricket stridulates with 
its wings, the grasshopper with its legs; birds and mammals vocalize with 
their eating and breathing apparatus. The quality and range of possible 
signals is thus limited by the structure and function of the co-opted 
mechanism. 
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A further constraint on signal form must come from the perceptual system 
to which the signals are addressed. Here again specialized devices (e.g.r 
feature detecting systems, templates) have certainly evolved, presumably by 
some minimal modification of a pre-existing perceptual system. Typically, 
such specialized devices, in the auditory realm, seem to have evolved in 
animals with little or no parenf ^1 care and ther>2fore little opportunity to 
learn their species' call: bullfr.. treefrogs, certain species of bird, and 
so on. We have no evidence for such devices in the human. 

We are not then surprised that the main speech frequencies are spread 
over the three octaves (500-^000 Hz) to which the human auditory system is 
most sensitive, and that (as the quality of deaf speech attests) speech sounds 
have evolved to be heard, not seen. Thus, the differences in degree of 
constriction among high vowels, intra-oral fricatives, and stops are highly 
salient auditorily; but the same differences in, say, finger to thumb 
distance, would be scarcely detectable if they were incorporated in a visual 
sign language. Similarly, the abrupt acoustic changes at the onset of many CV 
syllables may have been favored, in part, because the mammalian auditory 
system is particularly sensitive to such discontinuities (Delgutte, 1982; 
Kiang, 1980; Stevens, 1981). The resulting auditory contrast perhaps 
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facilitates the listener's perceptual segmentation both of the syllable fran 
its context and of the consonant from its following vowel. 

On the other hand, the signs of American Sign Language have evolved (over 
the past 170 years) to be seen, not heard. Accordingly, signs formed at the 
center of the signing space (that is, in the foveal region of the viewer) tend 
to use smaller movements and smaller handshape contrasts than signs formed at 
the periphery (Siple, 1978). 

In short, even if the sense that informs us about our environment 
"matters little" in the farmyard (itself a dubious claim), it seems not to 
"matter little" for communication. Language has evolved within the 
constraints of pre-existing perceptual and motor systems. We surrender much 
of our power to understand that evolution, if we disregard the properties of 
those systems* And indeed, CAF concedes as much by citing with approval 
Lindblora's work on the emergence of phonetic structure. The success of that 
work, particularly for vowel systems, rests on an acoustic description of 
speech sounds, weighted according to a model of the auditory system, and on 
the use of an auditory distance metric to assess their perceptual 
distinctiveness. 
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How, then, are we to square ohe auditory properties of the speech signal 
with the evident amodality of the speech percept? We must, I think, question 
CAF's definition of speech events as "a talker's phonetically structured 
articulations." A speech event is not simply articulation, however structured, 
any more than a tennis serve is simply th-' server's swing. A speech event, 
even narrowly conceived as phonetic action, only occurs when a speaker 
executes, and a listener apprehends, a phonetic function. Elsewhere, Fowler 
(1980) has termed thiz function the talker's phonetic "intent" (cf. Liberman, 
1982). "Intent" seems to correspond, at least in level of abstraction, to 
task (or goal), the level at which Kelso et al. (1986/this volume) define a 
single function from which different, but equivalent, articulations may arise. 
Surely, this too must be the level — free of adventitious articulatory 
variation and its acoustic consequences — at which the listener's percept might 
properly be termed amodal. 

Looked at in this way, articulation becomes as much a medium of speech, 
structured by the talker's goals, as the acoustic signal, structured by the 
talker's articulations, and as its heard counterpart, structured by the 
listener's (suitably "attuned") auditory system. Each medium is then subject 
to its own characteristic type of variability. 

One happy side-effect of setting speaker and listener (articulation and 
audition) ou equal footing is that we can rationalize perceptual error more 
simply than does CAF. The likelihood of an error is a function of its cost. 
Collisions between swallows, swarming in hundreds through a cloud of insects, 
or between pelicans flocking and diving into a school of fish, are rare 
(though, pace direct realism, they do occur!). Natural selection prunes the 
error-prone from the species, honing the perceptuanotor systems of the 
survivors to a fine precision. By contrast, errors in speaking and listening 
carry essentially no penalty. Moreover, if phonetic form has been shaped by 
compromise between the articulatory capacities of a speaker and the perceptual 
capacities of a listener, we might expect some instability in phonetic 
execution, some slight oscillation between the opacity comfortable for a 
speaker, the transparency called for by a listener (cf. Slobin, 1980). We may 
view a conversation as a microcosm of evolution: the speaker balances a 
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desire to be understood against articu.latory ease, the listener a desire to 
understand against the costs of attention (Lindblom, 1983). Given these 
conflicting demands and the modest penalties for error, we might even be 
surprised that errors are not more frequent than they are. In this regard, 
while no one> so far as I know, has studied the social contexts in which 
perceptual errors occur, they are probably rare when the speaker is, say, 
delivering instructions for a parachute jump. 

In conclusion, the fact that we hear speech is no less important and no 
more accidental than the fact that we articulate it. Many of the 
long-standing problems of speech research, including normalization, 
segmentation, and even the lack of invarianoe, may be illuminated by an 
understanding of audition. Even if the information we extract is amodal, just 
what information we extract and the precision with which we extract it depend 
on our auditory sensitivity. 
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AN EVENT APPROACH TO THE STUDY OF SPEECH PERCEPTION FROM A DIRECT-REALIST 
PERSPECTIVE* 



Carol A. Fowler t 



1 . Introduction 

There is, as yet, no developed event approach to a theory of speech 
perception and, accordingly, no body of research designed from that 
theoretical perspective. I will offer my view as to the form that the theory 
will take, citing relevant research findings where they are available. The 
theory places constraints Cii a theory of speech production, too. Therefore, I 
will also have something to say about how talkers must talk for an event 
approach to be tenable. I will begin by defining the domain of the theory as 
I will consider it here. 

An ecological event is an occurrence in the environment defined with 
respect to potential participants in it. Like most ecological events 
(henceforth, events), one in which linguistic communication takes place is 
highly structured and complex. Accordingly, it can be decomposed for study in 
many different ways. One way in which it is almost invariably decomposed by 
psycholinguists and linguists is into the linguistic utterance itself on the 
one hand and everything else on the other. In ordinary settings in which 
communication takes place, this is almost certainly not a natural partitioning 
because it leaves out several aspects of the setting that contribute 
interactively with the linguistic utterance itself to the communication. 
These include the talker's gestures (McNeill, 1985), aspects of the 
environment that allow the talker to point rather than to refer verbally, and 
the audience whose shared experiences with the talker affect his or her 
speaking style. The consequences of making this cut have not been worked out, 
but, at least for purposes of studying language as communication, they may be 
substantial (cf. Beattie, 1983). For the present, however, I will prese.^ve 
the partitioning and one within that as well. 

The linguist, Hocketl (I960), points out that languages hav^^ "duality of 
patterning" — that is, they have words organized grammatically into sentences, 
and phonetic segments organized phonotacti cally into words. Both levels are 
essential to the communicative power of language. 

Grammatical organization of words into sentences gives linguistic 
utterances two kinds of power. First, the communicative content of an 
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utterance is superadd! tive with respect to the contents of the words composing 
the sentences taken as individuals. Second, talkers can produce novel 
utterances that the audience has not heard before; yet the utterance can 
convey the talkers' message to the audience. I will refer to a linguistic 
utterance at this level of description as a "linguistic event," and, having 
defined it, I will have little else to say about it until the final section of 
the paper. 

The second structural tier, in which phonetic segments constitute words, 
supoorts an indefinitely large lexicon. Were each word to consist of a 
holistic articulatory gesture rather than a phonotacti cally-organized sequence 
of phonetic segments, our lexicons would be severely limited in size. Indeed, 
recent simulations by Lindblom (Lindblom, MacNeilage, & Studdert-Kennedy , 
1983) show that, as the size of the lexicon is increased (under certain 
constraints on how new word labels are selected), phonetic structure emerges 
almost inevitably from a lexicon consisting initially of holistic closing and 
opening gestures of the vocal tract. These simulations may show how and why 
phonetic structure emerged in the evolution of spoken language and how and why 
it emerges in ontogeny. 

I will refer to a talker's phonetically-structured articulations as 
"speech events." It is the perception of these events that constitutes the 
major topic of the paper. A speech event may also be •''efined as a linguistic 
utterance *:aving phonetic structure as perceived by a listener. In defining a 
speech event interchangeably from the perspectives of talkers and listeners, I 
am making the claim, following others (e.g., Shaw, Turvey, & Mace, 1982) that 
a theory of event perc^-ption will adopt a "direct realist" stance. According 
to Shaw et al. : 

Scne form of realism must be captured in any theory that claims to 
be a theory of perception. To do otherwise would render impossible 
an explanation o" the practical success of perceptually guided 
activity, (p. 159) 

That is, to explain the success of perceptually-guided activity, perception is 
assumed to recover events in the real world. For this to be possible 
consistently (see Shaw & Bransford, 1977), perception must be direct, and in 
particular unmediated by cognitive processes of inferencing or hypothesis 
testing, which introduce the possibility of error 

By focusing largely on speech events, I will be discussing speech at a 
level at which it consists of phonetically-structured syllables, but not, 
necessarily of grammatical, meaningful utterances. It is ironic, perhaps, 
that a presentation at a conference on event perception should focus on a 
linguistic level that is not transparently significant ecologically. However, 
speech events can be defended as natural parti tioni ngs of linguistic 
events—that is, they can be defended as ecological events-- and there is 
important work to be done by event theorists even here. 

The defense is that talkers produce phonetically-structured speech, 
listeners perceive it as such, and thoy use the phonetic structure they 
perceive to guide their subsequent behavior'. Talkers reveal that they produce 
phonetically-structured words when they make speech errors. Most submorphemic 
errors are misorderings or substitutions of single phonetic segments (e.g., 
Shattuck-Hufnagel, 1983). For their part, listeners can b'^ shown to extract 



140 



Fowler: An Event Approach 



phonetic structure from a speech communication at least in certain 
experimental settings. That they extract it generally, however, is suggested 
by the observation that they use phonetic variation to mark their 
identification with a social group or to adjust their speaking style to the 
conversational setting. Of course, infant perceivers must recover phonetic 
structure if they are to become talkers who make segmental speech errors. 

This defense is not intended to suggest that the study of perception of 
speech events is primary or privileged in any sense. It is only to defend it 
as one of the partitionings of an event involving linguistic communication 
that is perceived and used by listeners; therefore is an event in its own 
right and requires explanation by a theory of perception. 

I will discuss an event approach to phonetic perception in the next Inree 
major sections of the paper. The first two sections consider direct 
perception, first of local, short-term events, and next of longer ones. The 
third section considers some affordances of phonetically-structured speech. 

Although there is lots of work to be done at this more fine-grained of 
the dual levels of structure in language, there are also great challenges to 
an event theory offered by language considered as syntactically-structured 
words that convey a message to a listener. I will discuss just two of these 
challenges briefly at the end of the paper and will suggest a perspective on 
linguistic events that an event theory might take. 

2. Perception of Speech Events: A Local Perspective 

There is a general paradigm that all instances of perception appear to 
fit. Perception requires events in the environment ("distal events"), and one 
or more "informational media" — that is, sources of information about distal 
events in energy media that can stimulate the sense organs — and a perceives. 
As already noted, objects and occurrences in the environment are generally 
capable of multiple descriptions. Those that are relevant to a perceiver 
refer to "distal events." They have "affordances"-- that is, sets of 
possibilities for interaction with them by the perceiver. (Affordances are 
"what [things] furnish, for good or ill" [Gibson, 1967/1982; see also, Gibson, 
1979]). An informational /nedium, including reflected light, acoustic signals, 
and the perceiver' s own skin, acquires structure from an environmental event 
specific to certain properties of the event; because it acquires structure in 
this way, the medium can provide information about the event properties to a 
sensitive perceiver. A second crucial characteristic of an informational 
medium is that it can convey its information to perceivers by stimulating 
their sense organs and imparting some of its structure to them. By virtue of 
these two characteristics, informational media enable direct perception of 
environmental events. The final ingredient in the paradigm is a perceiver who 
actively seeks out information relevant to his or her current needs or 
concerns. Perceivers are active in two senses. They move around in the 
environment to intercept relevant sources of information. In addition, in 
ways not yet well understood, they "attune" their perceptual systems (e.g., 
Gibson, 1966/1 982) to attend selectively to different aspects of available 
environmental structure. 

In speech perception, the distal event considered locally is the 
articulating vocal tract. How it is best described to reflect its 
psychologically significant properties is a problem for investigators of 
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speech perception as well as of speech production. However, I will only 
characterize articulation in general termr. here, leaving its more precise 
description to Kelso, Saltzman and Tuller (1986/this volume) in their 
presentation. One thing we do know is that phonetic segments are realized as 
coordinated gestures of vocal- tract structures— that is, as coupled 
relationships among structures that jointly realize the segments (e.g., Kelso 
Tuller, Vatikiotis-Bateson, & Fowler, igS^J). Therefore, studies of the 
activities of individual muscles or even individual articulators will not 
reveal the systems that constitute articulated phonetic segments. 

The acoustic speech signal has the characteristics of an informational 
medium. It acquires structure from the activities of the vocal tract and it 
can impart its structure to an auditory perceptual system, thereby conveying 
its information to a sensitive pf^rceiver. In this way, it enables direct 
perception of the environmental source of its structure, the activities of the 
vocal tract. Having perceived an utterance, a listener has perceived the 
various "aff ordances" of the conversational event and can guide his or her 
subsequent activities accordingly. 

This, xn outline form, is a theory of the direct perception of speech 
events. The theory promotes a research program having four parts, three 
relating to the conditions supporting direct perception of speech events and 
the last relating to the work that speech events do in the environment. To 
assess the claim that speech events are directly perceived, the articulatory 
realizations of phonetic r,egments must be uncovered and their acoustic 
consequences identified. Next, the listener's sensitivity to, and use of, the 
acoustic information must be pinned down. Finally, the listener's use of the 
structure in guiding his or her activities must be studied. Although, of 
course, a great deal of research has been done on articulation and perception 
of speech, very little has been conducted from the theoretical perspective of 
an event theory and very little falls within the researcn program just 
outlined. 
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Indeed, my impression, based on publishing investigations of speech 
conducted from this perspective and on presentations of the theoretical 
perspective to other speech researchers, is that it has substantial face 
invalidity. There are several things seemingly true of speech production and 
perception that, in the view of many speech researchers, preclude development 
of a theory of direct perception of speecn events. I will consider four 
barriers to the theory, and along with some suggestions concerning ways to 
surmount or circumvent them. 

2.1 The First Barrier : _If Listener's Recover Articulation , Why Don't They Know 
X t? 

A Claim that perceivers see environmental events rather than the optic 
array that stimulates their visual systems seems far less radical than a claim 
that they hear phonetically-structured articulatory gestures rather than the 
acoustic speech signal. Indeed, when Repp (f98l) makes the argument that 
phonetic segments are "abstractions" and products of cognitive processes 
applied to stimulation, he says of them that " they have no phys ical 
properties— such as duration, spectrum and amplitude— and, therefore, cannot 
be measured (p. 1J^63, italics in the original). That is, he assumes that if 
phonetic segments were to have physical p^^operties, the properties would be 
acoustic. Yet no one thinks that, if the objects of visual perception— that 
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is, trees, tables, people, etc—do have physical properties, their properties 
are those of reflected light. 

Somewhat compatibly, our phenomenal experience when we hear speech 
certainly is not of lips closing, jaws raising, velums lowering, and so on, 
although our visual experience is of the objects and events in the world. Of 
course, we do not experience surface features of the acoustic signal 
either — that is, silent gaps followed by stop-bursts, or formant patterns, or 
nasal resonances. 



I cannot explain the failure of our intuitions in speech to recognize 
that perceived phonetic events are articulatory , as compared to our intuitions 
about vision, which do recognize that perceived events are environmental, but 
I can think of a circumstance that exacerbates the failure among researchers. 
If, in an experimental study, listeners do indeed recover articulatory events 
in perception, there is likely to be a large mismatch between the level of 
description of an articulatory event that they recover in an experimental 
study and a researcher's description of the activities of the individual 
articulators. That is, speech researchers do not yet know what articulatory 
events consist of. If a perceiver does not experience "lips closing," for 
example, that is as it should be, because lip closure per se is not an 
articulatory speech event. Rather (see the contribution by Kelso et al.), an 
articulatory event that is a phonetic event, for example, is a coordinated set 
of movements by vocal tract structures. By hypothesis, the percept [b] 
corresponds to extraction from the acoustic speech signal of information that 
the appropriate coordinated gestures occurred in the talker's vocal 
tract — just as as the perceptual experience of a zooming baseball corresponds 
to extraction of information frcDrn the optic array that the evont of zooming 
occurred in the environment. 



The literature offers evidence from a wide variety of sources that 
listeners do extract information about articulation from the acoustic speech 
signal. Much of this evidence has recently been reviewed by Liberman and 
Mattingly (1985) in support of a motor theory.^ I will select just a few 
examples. 

1 . Perceptual equivalence of distinct acoustic "cues" specifying the 
same articulatory eyenc. In nonphonetic contexts, silence produces a very 
different perceptual experience frcDm a set of formant transitions. However, 
interposed between frication for an [s] and a syllable sounding like [lit] In 
isolation, they may not (Fitch, Halwes, Erickson, & Liberman, 1980). An 
appropriate interval of silence may foster perception of [p]; so may a lesser 
amount of silence, insufficient to cue a [p] percept in itself, followed by 
transitions characteristic of [p] release. Strikingly, a pair of syllables 
differing both in the duration of silence after the the [s] frication and in 
presence or absence of [p] transitions following the silence are either highly 
discriminable — and more discriminable than a pair of syllables differing along 
j ust one of these dimensions — or nearly indl scriml nable — and less 
discriminable than a pair differing In just one dimension — depending on 
whether the silence and transitions "cooperate" or "conflict." They cooperate 
if, within one syllable, both acoustic segments provide evidence for stop 
production and, within the other, they do not. They conflict if the syllable 
having a relatively long interval of silence appropriate to stop closure lacks 
the formant transitions characteristic of stop release, while the syllable 
with a short interval of silence has transitions. Depending on the durations 
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of silence, these latter syllables may both sound I'ke '^split" or both like 
"slit." 

The important point is that very different acoustic properties sound 
similar or the same just when the information they convey about articulation 
is similar or the same. It should follow, and does, that when an articulation 
causes a variety of acoustic effects (for example, Lisker, [1978]. has 
identified more than a dozen distinctions between voiced and voiceless stops 
intervocalically) , the acoustic consequences individually tend to be 
sufficient to give rise to the appropriate perception but none are necessary* 
(See Liberman & Mattingly, 1985, for a review of those findings.) 

2. Different perceptual experiences of the same acoustic segment Just 
when it specifies different distal sources. By the same token, the same 
acoustic segment in different contexts, where it specifies different 
articulations or none at all, sounds quite different to perceivers. In the 
experiment by Fitch et al. just described, a set cf transitions characteristic 
of release of a bilabial stop will only give rise to a stop percept in that 
context if it is preceded by sufficient sufficient silence. This cannot be 
because, in the absence of silence, the [s] frication masks the transitions; 
other research demonstrates that transitions at fricative release themselves 
do contribute to fricative place perception (e.g., Harris, 1958; Whalen, 
1 981 ). Rather , it seems, release can only be perceived in this context given 
sufficient evidence for prior stop closure. Similarly, if transitions are 
presented in isolation where, of course, they do not signal stop release, or 
even production by a vocal tract at all, they sound more-or~less the way that 
they look on a visual display — that is, like frequency rises and falls (e.g., 
Mattingly, Liberman, Syrdal, & Halwes, 1971). 

3. "P center." Spoken digits (Morton, Marcus, & Frankish, 1976) or 
nonsense monosyllables (Fowler, 1979), aligned so that their onsets of 
acoustic energy are isochronous, do not sound isochronous to listeners. Asked 
to adjust the timing of pairs of digits (Marcus, 1981) or monosyllables 
(Cooper, Whalen, & Fowler, 198^) produced repeatedly in alternation so that 
they sound isochronous, listeners introduce systematic departures from 
measured isochrony — Just those that talkers introduce if :hey produce the same 
utterances to a real (Fowler & Tassinary, 1981; Rapp, 1971) or imaginary 
(Fowler, 1979; Tuller & Fowler, 1980) metronome. Measures of muscular 
activity supporting the talkers' articulations is isochronous in rhyming 
monosyllables produced to an imaginary metronome. Thus, talkers follow 
instructions to produce isochronous sequences, but due (in large part) to the 
different times after articulatory onset that different phonetic segments have 
their onsets of acoustic energy, acoustic measurements of their productions 
suggest a failure of isochrony. For their part, listeners appear to hear 
through the speech signal to the timing of the articulations. 

^. Lip reading . Liberman and Mattingly (1985) describe a study in which 
an acoustic signal for ^ production of [ba] synchronized to a face mouthing 
[be], [ve] and [5e] may be heard as [ba], [va] and [3a], respectively 
(cf. McGurk & MacDonald, 1976). Listeners experience hearing syllables with 
properties that are composites of what Is seen and heard, and they have no 
sense that place information is acquired largely visually and vowel 
information auditorily. (This is reminiscent of the quotation from Hornbostel 
[1927] reprinted in Gibson [1966]: "it matters little through which sense I 
realize that in the dark I have blundered into a pigsty." Likewise, it seems, 
it matters little through what sense we realize what speech event hc\s 
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occurred.) Within limits anyway, information about articulation gives rise to 
an experience of hearing speech, whether the information is in the optic array 
or in the acoustic signal. 

2.2 The Second Barrier ; Linguistic Units Are Not Literally Articulated 

A theory of perception of speech events is disconfirmed if the linguistic 
constituents of communications between talkers and listeners do not make 
public appearances. There are two kinds of reason for doubting that they clo, 
both relating to an incommensurability that many theorists and researchers 
have identified between knowing and doing, between competence and performance, 
or even between the mental and the physical realizations of language. 

One kind of incommensurability is graphically illustrated by Hockett^a 
Easter egg analogy (Hockett, 1955). According to the analogy, articulation, 
and, in particular the coarticulation that inertial and other physical 
properties of the vocal tract requires, obliterates the discrete, context-free 
phonetic segments of the talker's planned linguistic message. Hockett 
suggests that the articulation of planned phonetic segments is analogous to 
the effects that a wringer would have on an array of (raw) Easter eggs. If 
the analogy is apt, and listeners nonetheless can recover the phonetic 
segments of the talker's plan, then direct detection of articulatory gestures 
in perception cannot fully explain perception, because the gestures themselves 
provide a distorted representation of the segments. To explain recovery of 
phonetic segments from the necessarily impoverished information in the 
acoustic signal, reconstructive processes or other processes involving 
cognitive mediation (Hammarberg, 1976, 1982; Hockett, 1955; Neisser, 1967; 
Repp, 1981) or noncogniti've mediation (Liberman & Mattingly, 1985) must be 
invoked. 

Hockett is not the only i heorlst to propose that ideal phonetic segments 
are distorted by the vocal tract. For example, MacNeilage and Ladefoged 
describe planned segments as discrete, static, and context-free, whereas 
uttered segments are overlapped, dynamic, and context-sensitive. 

A related view expressed by several researchers is that linguistic units 
are mental things that, thereby, cannot be identified with any set of 
articulatory or acoustic characteristics. For example: 

[Phonetic segments] are abstractions . They are the end result of 
complex perceptual and cognitive processes In the listener's brain. 
(Repp, 1981, p. ^^6?.) 

They [phonetic categories] have no physical properties. (Repp, 
1981, p. 1^63) 

Segments cannot be objectively observed to exist in the speech 
signal nor in the flow of articulatory movements. .. LT]he concept of 
segment is brought to bear a priori on the study of 
physical- physiological aspects of language. (Hammarberg, 1 976, 
p. 355) 

tT]he segment is internally generated, the creature of some kind of 
perceptual-cognitive process. (Hammarberg, 1976, p. 355) 

This point of view, of course, requires a mentalist theory of perception. 
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For a realist event theory to be possible, what modifications to these 
views are required? The essential modification is to our conceptualization of 
the relation between knowing and doing. First, phonetic segments as we know 
them can have only properties that can be realized in articulation. Indeed, 
from an event perspective, the primary reality of the phonetic segment is its 
public realization as vocal-tract activity. What we know of the segments, we 
know from hearing them produced by other talkers or by producing them 
ourselves. Second, the idea that speech production involves a translation 
from a mental domain into a physical, nonmental domain such a6 the vocal tract 
must be di s car ded . 

With respect to the first point, we can avoid the metaphor of Hock6t;'s 
wringer if we can avoid somehow ascribing properties to phonetic segments that 
vocal tracts cannot realize. In view of the fact that phonetic segments 
evolved to be spoken, and indeed, that have evolved to speak them 

(Lieberman, 1982), this does not seem to be a radical endeavor. 

Vocal tracts cannot produce a string of static shapes, so for an event 
theory to be possible, phonetic segments cannot be inherently static. 
Likewise, vocal tracts cannot produce the segments discretely, if discrete 
means "nonoverlapping." However, neither of these properties is crucial to the 
work that phonetic segments do in a linguistic communication and therefore can 
be abandoned without loss. 

Phonetic segments do need to be separate one from the other and serially 
ordered, however, and Hockett's Easter egg analogy suggests that they are not. 
My own reading of the literature on coarti culation, however, is that the 
Easter egg analogy is misleading and wrong. Figure 1 is a redrawing of a 
figure from Carney and Moll (1971). It is an outline drawing of the vocal 
tract with three tongue shapes superimposed. The shapes were obtained by cine- 




/u/in/husi/ 

/s/in/husi/ 

/i/in/husi/ x— x— x 



Figure 1: Cinef luorographic tracing of the vocal tract during three phases in 
production of /husi/ (redrawn from Carney & Moll, 1971). 

146 

Er|c 152 



Fowler: An Event Approacn 



fluorography at three points in time during the production of the disyllable 
[husi]. The solid line reflects the tongue shape during a central portion of 
the vowel [u]; the dashed line is the tongue shape during closure for [s]; the 
x-ed line is the tongue shape during a central portion of [i]. Thus, the 
figure shows a smooth vowel- to vowel gesture of the tongue body taking place 
during closure of [s] (cf. Ohman, 1966). The picture these data reveal is 
much cleaner than the Easter egg metaphor would suggest. Gestures for 
different segments overlap, but the separation and ordering of the segments is 
preserved. ^ 

With respect to the second point, Ryle (19^9) offers a way of 
conceptualizing the relation between the mental cmd the physical that avoids 
the problems consequent upon identifying the mental with covert processes 
taking place inside the head: 

When we describe people as exercising qualities of mind, we are not 
referring to occult episodes of whi ch their overt acts and 
utterances are effects, we are referring to those overt acts and 
utterances themselves, (p. 25) 

When a person talks sense aloud, ties knots, feints or sculpts, the 
actions which we witness are themselves the things which he is 
intelligently doing. ..He is bodily active and mentally active, but 
he is not being synchronously active in two different "places," or 
with two different "engines." There is one activity, but it is 
susceptible of and requiring more than one kind of explanatory 
description, (pp. 50-51) 



This way of characterizing intelligent action does not eliminate the 
requirement that linguistic utterances must be planned. Rather it eliminates 
the idea that covert processes are pri vileged in being mental or 
psychological, whereas overt actions are not. Instead, we may think of the 
talker's intended message as it is planned, uttered, specified acoustically, 
and perceived as being replicated intact across different physical media from 
the body of the talker to that of the listener. 

An event theory of speech production must aim to characterize 
articulation of phonetic segments as overlapping sets of coordinated gestures, 
where each set of coordinated gestures conforms to a phonetic segment. By 
hypothesis, the organization of the vocal tract to produce a phonetic segment 
is invariant over variation in segmental and suprasegmental contexts. The 
segment may be realized somewhat differently in different contexts (for 
example, the relative contributions of the jaw and lips may vary over 
different bilabial closures [Sussman, MacNeilage, & Hanson, 1 973]), because of 
competing demands on the articulators made by phonetic segments realized in an 
overlapping time frame. To the extent that a description of speech production 
along these lines can be worked out, the possibility remains that phonetic 
segments are literally uM"^red and therefore are available to be directly 
perceived if the acoustic s 1 is sufficiently informative. Research on a 
"task dynamic" model of .-ech production (e.g., Kelso et al., 1986, this 
volume; Saltzman, in press; Saltzman & Kelso, 1983) may provide at the very 
least an existence proof that systems capable of realizing overlapping 
phonetic segments nondestructi vely can be devised. 
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2.3 The Third Barrier ; The Acoustic Signal Does Not Specify Phonetic Segments 

Putting aside the question whether phonetic segments are realized 
nondestructlvely In articulation, there remains the problem that the acoustic 
signal does not seem to reflect the phonetic segmental structure of a 
linguistic communication. It need not, even If phonetic segments are uttered 
Intact. Although gestures of the vocal tract cause disturbances In the air, 
It need not follow that the disturbances specify their causes. For many 
researchers, they do not. Figure 2 (from Fant & Llndblom [1961] and Cutting & 
Plsonl [1978]) displays the problem. 

A spectrographlc display of a speech utterance Invites segmentation Into 
"acoustic segments" (Fant, 1 973). Visibly defined, these are relatively 
homogeneous Intervals In the display. Segmentation lines are drawn where 
abrupt changes are noticeable. The difficulty with this segmentation Is the 
relation It bears to the component phonetic segments of the linguistic 
utterance. In the display, the utterance Is the name, "Santa Claus," which Is 
composed of nine phonetic segments, but 18 acoustic segments. The relation of 
phonetic segments to acoustic segments Is not simple as the bottom of Figure 2 
reveals. Phonetic segments may be composed of any number of acoustic 
segments, from two to six In the figure, and most acoustic segments reflect 
properties of mere than one phonetic segment. 

How do listeners recover phonetic structure from such a signal? One 
thing Is clear; the functional parsing of the acoustic signal for the 
percelver Is not one Into acoustic segments. Does It follow that per eel vers 
Impose tihelr own parsing on the signal? There must be a "no" answer to this 
question for an event theory devised from a direct-realist perspective to be 
viable. The perceived parsing must be in the signal; the special role of the 
perceptual system is not to create It, but only to select it. 

The first point to be made in this regard is that there is more th^n one 
physical description of the acoustic speech signal. A spectrographlc display 
suggests a parsing into acoustic segments, but other displays suggest other 
parsings of the signal. For example, Kewley-Port (1983) points out frit in a 
spectrographlc display the release burst of a syllable-initial stop consonant 
looks quite distinct from the formant transitions that follow it (for example, 
see the partitioning of /k/ in "Claus" in Figure 2). Indeed, research using 
the spectrographlc display as a guide has manipulated burst and transition to 
study their relative salience as information for stop place (e. g. , Dorman , 
Studdert-Kennedy, & Raphael, 1977). However, Kewley-Port' s "running spectra" 
for stops (overlapping tpectra from 20 ms windows taken at successive five ms 
intervals following stop release) reveal continuity between burst and 
transitions in changes in the locaticn of spectral peaks from burst to 
transition. 



EKLC 



It does not follow, then, from the mismatch between acoustic segment and 
phonetic segment, that there is a mismatch between the Information in the 
acoustic signal and the phonetic segments in the talker's message. Possibly, 
in a manner as yet undiscovered by researchers but accessed by percelvers, the 
signal if transparent to phonetic segments. 

If it is, two research strategies should p/^vide converging evidence 
concerning the psychologically relevant description of the acoustic signal. 
The first seeks a description of the articulatory event Itself— that is, of 
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Figure 2: (a) Spectrographic display of "Santa Claus"; (b) Schematic display 
of the relationship between acoustic and phonetic segments 
(reprinted with permission from Cutting & Pisoni, 1 978, and Fant & 
Lindblom, 1961). 
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sequences of phonetic segments as art iculated— and then investigates the 
acoustic consequences of the essential articulatory components of phonetic 
segments. A second examines the parsing of the acoustic signal that listeners 
detect. 

The research that comes closest to this characterization is that of 
Stevens and Blumstein (1978, 1981; Blumstein & Stevens, 1979, 1981). They 
begin with a characterization of phonetic segments and, based on the acoustic 
theory of speech production (Fant, I960), develop hypotheses concerning 
invariant acoustic consequences of essential articulatory properties of the 
segments. They then test whether the consequences are, in fact, invariant 
over talkers and phonetic-segmental contexts. Finally, they ask whether these 
consequences are used by perceivers. 

Unfortunately for the purposes of an event approach, perhaps, they begin 
with a characterization of phonetic segments as bundles of distinctive 
features. This characterization differs in significant ways from one that 
will be developed from a perspective on phonetic segments as coordinated 
articulatory gestures (see, for example, Browman and CJoldstein, in press). 
One important difference is that the features tend to be static; accordingly, 
the acoustic consequences first sought in the research program were static 
also. A related difference is that the characterization deals with 
coarticulation by presuming that the listener gets around it by focusing his 
or her attention on the least coarticulated parts of the signal. As I will 
suggest shortly, that does not conform with the evidence; nor would it be 
desirable, because acoustic consequences of coarticulated speech are quite 
informative (cf. Elman & McClelland, 1983). 

To date, Stevens and Blumstein have focused most of their attention on 
invariant information for consonantal place of articulation. Their hypotheses 
concerniirig possible invariants are based on predictions derived from the 
acoustic theory of speech production concerning acoustic correlates of 
constrictions in various parts of the vocal tract. As Stevens and Blumstein 
(1981) observe, when articulators adopt a configuration, the vocal tract form.^ 
cavities that have natural resonances, t'le formants. Formants create spectral 
peaks in an acoustic signal — that is, a range of frequencies higher in 
intensity than their neighbors. A constriction in the vocal tract affects the 
resonance frequencies and intensities of the formants. Thus, stop consonants 
with different places of articulation should have characteristic burst spectra 
independent of the vowel following the consonant and independent of the size 
of the vocal tract producing the constriction. 

Blumstein and Stevens (i979) created "template" spectra for the stop 
consonants, /b/, /d/ and /g/, and then attempted to use them to classify the 
stops in 1800 CV and VC syllabl(;S in which the consonants were produced by 
different talkers in the context of various vowels. Overall, they were 
successful in classifying syllable-initial stops, but less successful with 
final stops, particularly if the stops were unreleased. Blumstein and Stevens 
(1980) also showed that listeners could classify stops by place better than 
chance when they were given only the first 10-i<6 ms of CV syllables. 

However, two investigations have shown that the shape of the spectrum at 
stop release is not important source of information for stop place. Those 
studies (Blumstein, Isaacs, & Mertus, 1982; Walley & Carrell, 1983) pitted 
place information contributed by the shape of the spectrum at stop release in 
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CVs, against the (context-dependent) information for place contributed by the 
formant frequencies themselves. In both studies, the formants overrode the 
effect of spectral shape in listeners' judgments of place. 

Recently, Lahiri, Gewirth, and Blumstein (198^) have found in any case 
that spectral shape does not properly classify labial, dental, ana alveolar 
stops produced by speakers of three different languages. In search of new 
invariants and following the lead of Kewley-Port (1983), they examined the 
information in running spectra. They found that they could classify stops 
according to place by examining relative shifts in energy at high and low 
frequencies from burst to voicing onset. Importantly, pitting the appropriate 
running spectral patterns against formant frequencies for 10 CV syllables in a 
perceptual study, Lahiri et al. found that the spectral information was 
overriding. The investigators identify their proposed invar i ants as 
"dynamic," because they are revealed over time during stop release, and 
relational because they are based on relative changes in the distribution of 
energy at high and low frequencies in the vicinity of stop release. 

Lahiri et al. are cautious whether their proposed invariants will 
withstand further test — and properly so, because the invariants are sonewhat 
contrived in their precise specification. I suspect that major advances in 
the discovery of invariant acoustic information for phonetic segments will 
follow advances in understanding how phonetic segments are articulated. 
However, the proposals of Lahiri et al. (see also Kewley-Port, 1983) 
constitute an advance over the concept of spectral shape in beginning to 
characterize invariant acoustic information for gestures rather than for 
static configurations. 

2.^ The Fourth Barrier ; Perception Demonstrably Involves "Top Down" Processes 
and Perceivers Do Make Mistakes 

Listeners may "restore" missing phonetic segments in words (Samuel, 1981; 
Warren, 1970), and talkers shadowing someone else's speech may "fluently 
restore" mispronounced words to their correct forms (e.g., Marsl en-Wilson & 
Welsh, 1978). Even grosser departures of perceptual experience from 
stimulation may be observed in some mishearings (for example, "popping really 
slow" heard as "prodigal son" [Browman, 1 980] or "mow his own lawn" heard as 
"blow his own horn" [Garnes & Bond, 1980]). 

These kinds of findings are often described as evidence for an 
interaction of "bottom up" and "top down" processes in perception (e.g., 
Klatt, 1 980) . 3ottom-up processes analyze stimulation as it comes in. 
Top-down processes draw inferences concerning stimulation based both on the 
fragmentary results of the ongoing bottom-up processes and on stored knowledge 
of likely inputs. Top-down processes can restore missing phonemes or correct 
erroneous ones in real words by comparing results of bottom-up processes 
against lexical entries. As for mishearings, Garnes and Bond (1980) argue 
that "active hypothesizing on t\\e part of the listener concerning the intended 
message is certainly part of the speech perception process. No other 
explanation is possible for misperceptions which quite radically restructure 
the message..." (p. 238) 

In my view (but not necessarily in the view of other event theorists), 
th<:se data do offer a strong challenge to an event theory. It is not that an 
an event theory of speech perception has nothing to say about perceptual 
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learning (for example, Gibson, 1966; Johnston & Pietrewicz, 1985). However, 

what is said is not yet well enough worked out to specify how, for example' 

lexical knowledge can be brought to bear on speech input from an 
direct-realist, event perspective. 

With regard to mishearings, there is also a point of view (Shaw, Turvey, 
& Mace, 1982) that when reports oi environmental events are in error, the 
reporter cannot be said to have perceived the events, because the word 
"perception" is reserved for just those occasions when acquisition of 
information from stimulation is direct and, therefore, successful. The 
disagreement with theories of perception as indirect and constructive, then, 
may reduce to a disagreement concerning how frequently bottom-up processes 
complete their work in the absence of top-down influence. 

I prefer a similar approach to that of Shaw et al. that makes a 
distinction between what perceivers ran do and what they may do in particular 
settings. As Shaw et al. argue, there is a need for the informational support 
for activity to be able be directly extracted from an informational medium 
and for perception to be nothing other than direct extraction of information 
from proximal stimulation. However, in familiar environments, actors may 
generally guide their activities based not only on what they perceive, but 
also on what the environment routinely affords. In his presentation at the 
first event conference, Jenkins (1985) reviews evidence that the bat's 
guidance of flying sometimes takes this form. Placed in a room with barriers 
that must be negotiated to reach a food source, the bat soon learns the route 
(Griffin, 1 958). After some time in which the room layout remains unchanged, 
a barrier is placed in the bat's usual flight path. Under these novel 
conditions, the bat is likely to collide with the barrier. Although it could 
have detected the barrier, it did not. By the same token, as a rule, we 
humans do not test a sidewalk to ensure that it will bear our weight before 
entrusting our weight to it. Nor do we walk through (apparent) apertures with 
our arms outstretched Just in case the aperture does not really afford passage 
because someone has erected a diff icult-to-see plate-glass barrier. In short, 
although the affordances that guide action can be directly perceived, often 
they are not wholly. We perceive enough to narrow down the possible 
environments to one likely environment that affords our intended activity and 
other remotely likely ones that may not. 

Perceptual restorations and mishearings imply the same perceptual 
pragmatism among perceivers of speech. It is also implied, I think, by 
talkers' tendencies to adjust the formality of their speaking style to their 
audience (e.g., Labov, 1972). Audiences with whom the talker shares 
substantial past experiences may require less information to get the message 
than listeners who share less. Knowing that, talkers conserve effort by 
providing less where possible. 

It may be important to emphasize that the foregoing attempt to surmount 
the fourth barrier is intended to do more than translate a description of of 
top-down and bottom-up processes into a terminology more palatable to event 
theorists. In addition, I am attempting to allow a role for information not 
currently in stimulation to guide activity while preserving the ideas that 
perception itself must be direct and hence, errorless, and that activity can 
be (but often is not) guided exclusively by perceived affordances. 
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As to the latter idea, the occurrence of mishearings that depart 
substantially from the spoken utterance should not deflect our attention from 
the observation that perceivers can hew the talker's articulatory line very 
closely if encouraged to do so. One example from my own research is provided 
by investigations of listeners' perceived segmentation of speech. Figure 1 
above, already described, displays coarticulation of the primary articulators 
for vowels and consonants produced in a disyllable. This overlap has two 
general consequences in the acoustic signal (one generally acknowledged as a 
consequence, the other not). First, within a time frame generally identified 
with one phonetic segment (because the segment's acoustic consequences are 
dominant), the acoustic siriTial is affected by the segment's preceding and 
following neighbors. Second, because the articulatory trajectories for 
consonants overlap part of the trajectory of a neighboring vowel (cf. Carney & 
Moll, 1971; Ohman, 1966), the extent of time in the acoustic signal during 
which the vowel predominates in its effects — and hence the vowel's measured 
duration — decreases in the context of many consonants or of long consonants as 
compared to its extent in the context of few or short consonants (Fowler, 
1983; Fowler & Tassinary, 1981; Lindblcm & Rapp, 1973). 

Listeners can exhibit sensitivity to the information for the overlapping 
phonetic segments that talkers produce in certain experimental tasks. In 
these tasks, the listeners use acoustic information for a vowel within a 
domain identified with a preceding consonant (for example, within a stop burst 
or within frication for a fricative consonant) as information for the vowel 
(Fowler, 198^1; Whalen, 198^1). Moreover, listeners do not integrate the 
overlapping information for vowel and consonant. Rather, they hear the 
consonant as if the vowel information had been factored from it (Fowler, 1985) 
and they hear the vowel as longer than its measured extent by an amount 
correlated with the extent to which a preceding consonant should have 
shortened it by overlapping its leading edge (Fowler & Tassinary, 1981). 

These studies indicate that listeners can track the talker's vocal tract 
activities very closely and, more specifically, that they ext/^act a 
segmentation of the signal into the overlapping phonetic segments that calkers 
produce, not into discrete approximations to phonetic segments and not into 
acoustic segments. Or cour'se, this is as it must be among young perceivers if 
they are to learn to talk based on hearing the speech of others. But whether 
or not a skilled listener will track articulation this closely in any given 
circumstance may depend on the extent to which the listener estimates that he 
or she needs to in order to recover the talker's linguistic message. 

3. Perception of Speech Events in an Expanded Time Frame: Sound Change 

Two remarkable facts about the bottom tier of dually structured language 
are that its structure undergoes systematic change over time and that the 
sound inventories and phonetic processes of language reflect the articulatory 
dispositions of the vocal tract and perceptual dispositions of the ear 
(Lindblom et al., 1 983; Locke, 1 983; Ohala, 1981; Donegan & Stampe, 1 979). 
There are many phonological processes special to individual languages that 
have analogues in articulatory-phc'ietic processes general to languages. For 
example, most languages have shorter vowels before voiceless than voiced stops 
(e.g., Chen, 1 970). However, in addition, among languages with a phonological 
length distinction, in some (for example, German; see Jomrie, 1980), 
synchronic or diachronic processes allow phonologically long vowels only 
before voiced consonants. Similarly, I have already described a general 
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articulatory tendency for consonants to overlap vowels in production so that 
vowels are measured to shorten before clusters or long consonants more than 
before singleton consonants or short consonants. C ^mpatibly, according to 
Elert (196^; cited in Lindblom & Rapp, 1973), in stressed syllables, Swedish 
^hort vowels appear only before long consonants or multiple consonants; long 
vowels appear before a short consonant or no consonant at all. In Yawelmani 
(see Kenstowiscz & Kisseberth, 1979), a long vowel is made sh^rt before a 
cluster. Stressed vowels also are measured to shorten in the context of 
following unstressed syllables in many languages (Fowler, 1981; Lehiste, 1972; 
Lindblom & Rapp, 1973; Nooteboom & Cohen, 1975). Compatibly, in Chimwi:ni 
(Kenstowiscz & Kisseberth, 1979), a long vowel may not, in general, occur 
before the antepenultimate sellable of a word. 

These are just a few examples involving duration that I have gathered, 
but similar examples abound as do examples of other kinds. We can ask: how 
do linguistic-phonological processes that resemble articulatory dispositions 
enter language? 

An interesting answer that Ohala (1981) offers to cover some cases is 
that they enter language via sound changes induced by systematic misperception 
by listeners. One example he provides is that of tonal development in "tone 
languages," including Chinese, Thai, and others. Tonal development on vowels 
may have been triggered by loss of a voicing distinction in preceding 
consonants. A consequence of consonant voicing is a rising tone on the 
following vowel (e.g., Hombert, 1979). Following a voiceless consonant, the 
tone is high and falling. Historical development of tones in Chinese may be 
explained as the listeners' systematic failure to ascribe the tone to 
consonant voicing — perhaps because the voicing distinction was weakening — and 
to hear it instead as an intentionally-produced characteristic of the vowel. 

This explanation is intriguing because, in relation to the perspective on 
perceived segmentation just outlined, it implies that listeners may sometimes 
recover a segmentation of speech that is not identical to the one articulated 
by the talker. In particular, it suggests that listeners may not always 
recognize coarticulatory encroachments as such and may instead integrate the 
coarticulatory influences with a phonetic segment with which they overlap in 
time. This may be especially likely when information for the occurrence of 
the coarticulating neighbor (or its relevant properties as in the case of 
voicing in Chinese) is weakening. However, Ohala describes some examples 
where coarticulatory information has been misparsed despite maintenance of the 
conditioning segment itself. Failures to recover the talker's segmental 
parsing may lead to sound change when listeners themselves begin producing the 
phonological segment as they recovered it rather than as the talkers produced 
it. 

Recent findings by Krakow, Beddor, (Goldstein, and Fowler (1985) suggest 
that something like this may underlie an ongoing vowel shift in English. In 
English, the vowel /ae/ is raising in certain phonetic contexts (e.g., Labov, 
1981). One context is before a nasal consonant. Indeed, for many speakers of 
English the /ae/ in "can," for example, is a noticeably higher vowel than that 
in "cad." 

One hypothesis to explain the vowel shift in the context of nasal 
consonants is that listeners fail to parse the signal so that all of the 
influences of the nasalization on the vowel are identified with the 
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coarticulatory influence of the nasal consonant. As Wright (1980) observes, 
the nasal formant in a nasalized vowel, is lower in frequency than F1 of /ae/. 
Integrated with F1 of /ae/ or mistakenly identified as F1 , the nasal formant 
is characteristic of a higher vowel (with a lower F1 ) than F1 of /ae/ itself, 

Krakow et al, examined this idea by synthesizing two kinds of continua 
using an articulatory synthesizer (Rubin, Baer, & Mermelstein, 1979). One 
continuum was a [bed] to [baed] series (henceforth, the bed-bad series) 
created by gradually lowering and backing the height of the synthesizer's 
model tongue in seven steps. A second, [bend] to [baend], continuum 
(henceforth bend-band) was created in similar fashion, but with a lowered 
velum during the vowel and throughout part of the following alveolar 
occlusion. (In fact, several bend-band continua were synthesized with 
different degrees of velar lowering. I will report results on just one 
representative continuum.) Listeners identified the vowel in each series as 
spelled with "E" or "A." Figure 3a compares the responses to members of the 
bed-bad continuum with responses to a representative bend-band series. As 
expected, we found a tendency for sibjects to report more "E"s in the 
bend-band series. 




Figure 3: Identification of vowels in the experiment of Krakow, Beddor, 
Ctoldstein, and Fowler (1985); see text for explanation. 

We reasoned that if this were due to a failure of listeners to parse the 
signal so that all of the acoustic consequences of nasality were ascribed to 
the nasal consonant, then by removing the nasal consonant itself, we would see 
as much or even more raising than in the context of a nasal consonant. 
Accordingly, we altered the original bod-bad series by lowering the model 
velum throughout the vowel. (I will call the new [bed/-bald] continuum the 
bed(N)-bad(N) series. Again, different degrees of nasality were used over 
different continua. I will report data from a representative series.) Figure 
3b shows the results of this manipulation. Rather than experiencing increased 
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raising, as expected, the listeners experienced significant lowering of the 
vowel in the bed(N)-bad(N) series. Although this outcome can be rationalized 
in terms of spectral changes to the oral formants of the vowel due to the 
influence of the nasal resonance on them, it does not elucidate the origin of 
the raising observed in the first study. 

A difference between our bend-band and bed-bad series was in the measured 
duration of the vowels. Following measurements of natural productions, we had 
synthesized syllables with shorter measured vowels in the bend-band series 
than in the bed-bad series. We next considered the possibility that this 
explained the raising we had found in the first experiment. /e/ is an 
"inherently" shorter vowel than /ae/ (e.g., Peterson & Lehiste, I960). It 
seemed possible that raising in the bend-band series was not due to misparsing 
of nasality, but to misparsing of the vowel's articulated extent from that of 
the overlapping nasal consonant. In particular, the vowels in the bend-band 
continua might have been perceived as inherently shorter (rather than as more 
extensively overlapped by the syllable coda) than vowels in the bed-bad 
series, and hence as more /e/-like. 

To test that idea, we synthesized a new bend- band series with longer 
measured durations of vowels, matching those in the original bed-bad (and 
03d(N)--'uad(N)) series and new bed-bad and bed(N)-bad(N) series with vowels 
shortened to match the measured duration of those in the original bend-band 
series. Figures ^^ and ^b ^how the outcome for the short and long series, 
respectively. Identification functions for bed-bad and bend-band are now 
icentical. Listeners ascribe all of the nasality in the vowel to the 
consonant, and when vowels are matched in measured duration, there is no 
raising. Stimuli in the bed(N)-bad(N) series show Ic -^ing in both Figures ^a 
and ^b. 




Figure il: Identification of vowels in continua having vowels matched in 
measured duration (data from Krakow et al., 1985). 
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These results are of interest ir, several respects. For the present 
discussion, they are interesting in s:iggesting limitations in the extent to 
which these listeners could track articulation. Although listeners do parse 
speech along its coarti culatory lines in this study ascribing the nasality 
during the vowel to the nasal consonant, they are not infinitely sensitive to 
parts of a vowel overlaid by a consonant. The difficulty they have detecting 
the trailing edges of a vowel may be particularly severe when the following 
consonants are nasals as in the present example, because, during a nasal, the 
oral cavity is sealed off and the acoustic signal mainly reflects pasj;age of 
air through the nasal cavity. Consequently, information for the vowel is 
poor. (There is vowel information in nasal consonants, however, as Fujimura, 
1 962, has shown.) 

In a study mentioned earlier. Fowler aiid Tassinary found that in a 
vowel-duration continuum in which voicing of a final alveolar stop was cued by 
vowel duration (cf. Raphael, 1972), the "voiceless" percept was resisted more 
for vowels preceded by consonants that, in natural productions, shorten their 
measured extents substantially than by consonants that shorten them less. In 
the study, however, the effect on the voicing boundary was less than the 
shortening effect of the preceding consonant would predict. Together, this 
study and that by Krakow et al. suggest that although listeners do parse the 
speech signal along coarti culatory lines, they do not always hear the vowels 
as extending throughout their whole coarti culatory extent. *• 

As Ohala has suggested (1981), these perceptual failures may provoke 
sound change. Thereby they may promote introduction into the phonologies of 
languages, processes that resemble articulatory dispositions. 

What are the implications of this way of characterizing perception and 
sound change for the theory of perception of speech events? In the account, 
perceivers clearly are extracting affordances from the acoustic signal. That 
is, they are extracting Information relevant to the guidance of their own 
articulatory activities. (See the following section for some other 
affordances perceived by listeners.) However, just as clearly, the distal 
event they reported in our experiment and that they reproduce in natural 
settings is not the one in the environment. The problem here may or may not 
be the same as that discussed as the "fourth barrier" above. In the present 
case, the problem concerns the salience of the information provided to the 
listener in relation to the listener's own sensitivity to it. Information for 
vowels where consonants overlap them presumably is difficult (but not 
impossible, see Fowler, 198^^; Whalen, 198^^) to detect. One way to handle the 
outcome of the experiment by Krakow et al. within a direct-realist event 
theory is to suppose that listeners extract less information from the signal 
than they need to report their percept in an experiment or to reproduce it 
themselves, and they fill in the rest of the information from experience at 
the time of report or reproduction. An alternative is that listeners are 
insensitive to the vowel information in the nasal consonant (either because it 
is not there or because they fail to detect it) and use that lack of 
information as information for the vowel's absence there. Presumably it is 
just the cases where important articulatory information is difficult to detect 
that undergo the perceptually-driven sound changes in languages (cf. Lindblom, 
1971). 
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^. How Perception Guides Action 

Some Affordances of Phonetically-Structured Speech 

For those of us engaged in research on phonetic perception, it is easy to 
lose sight of the fact that, outside of the labo'^atory, the object of 
perceiving is not the achievement of a percept, but rather the acquisition of 
information relevant to guidance of activ'ty, I will next consider how 
perception of phcnetically-structured vocal activity may guide the listener's 
behavior. This is not, of course, where most of the action is to be found in 
speech perception. More salient is that way the perception of the linguistic 
message guides the listener's behavior. This is a very rich topic, but not 
one that I can cover here. 

Possibly, the most straightforward activity for a listener just having 
extracted information about how a talker controlled his or her articulators 
(but not, in general, the most appropriate activity), is to control one's own 
articulators in the same way — that is, to imitate. Indeed, research suggests 
that listeners can shadow speech with very short latencies (Ch^stovich, Klaas, 
& Kuzmin, 1962; Porter, 1977) and that their latencies are shorter to respond 
with the same syllable or one that shares gestures with it than with one that 
does not (Meyer & Gordon, 198^). 

Although this has been interpreted as relevant to an evaluation of the 
motor theory of speech perception (Liberman, Cooper, Shankweller, & 
Studdert-Kennedy, 1967), it may also, or instead, reflect a more general 
disposition for listeners to mimic talkers (or perhaps to entrain to them). 
Research shows that individuals engaging in conversation move toward one 
another in speech ra^e (defined as number of syllables per unit time excluding 
pauses; Webb, 1972) in loudness (Black, 19^9) and in average duration of 
pauses (Jaffe, 196^), although the temporal parameters of speaking also show 
substantial stability among individual t''aker^s across a variety of 
conversational settings (Jarfe & Feidstein, 197^). In addition, Condon and 
Ogston (19*^1; also see Condon, 1976, for a review), report that listeners 
(including infants aged 1-^ days; Condon & Samierf 197^) move in synchrony 
with a talker's speech rhythms. 

Although it is possible that this disposition for "interactional 
synchrony" (Condon, 1 976) has a function, for example, in signaling 
understanding, empathy, or interest on the listener's part (cf. Matarazzo, 
1965), the observations that some of the visible synchronies have been 
observed when the conversational partners cannot see one another, and some 
have been observed in infants, may suggest a more primitive origin. Condon 
(1976) suggests that interactional synchrony is a form of entrainment. 

The disposition to imitate among adults may be a carryover from infancy, 
when presumably it does have an important function (Studdert-^Kennedy , 1983). 
Infants must extract information about phonetically-structured articulation 
from the acoustic speech signals of mature talkers in order to learn to 
regulate their own articulators to produce speech. Although it seenis 
essential that infants do this, very little research does more than hint tWat 
infants have the capacity to imitate vocal productions. 
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Infants do recognize the correspondence between visible articulation of 
others and an acoustic speech signal. They will Isok preferentially to the 
one of two videotapes on which a talker mouths a disyllable matching an 
accompanying acoustic signal (MacKain, Studdert-Kennedy , Spieker, & Stern, 
1:983). Moreover, infants recognize the equivalence of their own facial 
gestures to those of someone else. That is, they imitate facial gestures, 
such as lip or tongue protr ion (Meltzoff & Moore, 1985) even though, as 
Meltzoff and Moore poir out, such imitation is "intermodal because the 
infants cannot see their own gestures. Together, these findings suggest that 
infants should be capable of vocal imitation. 

Relatively few studies have examined infants' imitation of adult 
vocalizations, however. Infants are responsive to mothers' vocalizations, and 
indeed, vocalize simultaneously with them to a greater-than-chance extent 
(Stern, Jaffe, Beebe, & Bennett, 1975). There are a few positive reports of 
vocal imitation (e.g., Kessen, Levine, & Wendrick, 1979; Kuhl & Meltzoff, 
1982; Tuaycharoen, 1978: Uzgiris, 1973). However, few of them have been 
conducted with the controls now recognized as required to distinguish chance 
correspondences from true imitations. 

Of course, imitative responses are not the only activities afforded by 
speech, even speech considered only as phonetically-structured activity of the 
vocal tract. A very exciting area of research in linguistics is on natural 
variation in speaking (e.g., Labov, 1966/1982, 1972, 1980). The research 
examines talkers in something close to the natural environments in which 
talking generally takes i^lace. It is exciting because it reveals a remarkable 
sensitivity and responsiveness of langijiage users to linguistically-, 
psychologically, and socially-relevant aspects of conversational settings. 
Most of these aspects must be quite outsid^^ of the language users' awareness 
much of the time; yet they guide the talker's speech in quite subtle but 
observable ways. 

Labov and his colleagues find that an Individual's speaking style varies 
with the conversational setting in response, among other things, to 
characteristics of the conversational partner, including, presumably, the 
partner's own speaking style. Accordingly, adjustments to speaking style are 
afforded by the speech of the conversational partner. 

An example of research done on o. ilectal affordances of the speech of 
other social groups is provided by Labov' s early study of the dialects of 
Martha's Vineyard (1963). Martha's Vineyard is a small island off the coast 
of New England that is part of the state of Massachusetts, Whereas 
traditionally, residents were farmers and fishermen, in recent decades, the 
island has become a popular summer resort. The addition of some i^0,000 summer 
residents to the year-round population of 5*6000 has, of course, had profound 
consec;';A3nces for the island's economy. 

Labov chose to study production of two diphthongs, [aij and [au], both of 
which had lowered historically from the forms [ai] and [au]. These historical 
changes were not concurrent; [au] had lowered well before the :i^.ttl'ment of 
Martha- s Vineyard by English speakers in ^6^2; [ai] lowered somewhat af'.er its 
settlement. 
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Labov found a systematically increasing tendency to centralize the first 
vowel of the diphthongs—that is, to reverse the direction of sound change 
just described— in younger native residents when he compared speakers ranging 
in age from 30 upwards. The tendency to centralize the vowels was strongest 
among people such as farmers whose livelihoods had been most threatened by the 
summer residents. (The summer residents have driven up land prices as well as 
the costs of transporting supplies to the island and products to the 
mainland.) In addition, the tendency to centralize was correlated with the 
speaker's tendency to express resistance to the increasing encroachment of 
summer residents on the island. Among the youngest group studied, 15 year 
olds, the tendency to centralize the vowels depended 3trongly on the 
individual's future plans. Those intending to stay on the island showed a 
markedly stronger tendency to centralize the diphthongs than tnose intending 
to leave the island to make a living on the mainland. LabDV interpreted these 
trends as a disposition among many native islanders to distinguish themselves 
as a group from the summer residents. 

I find these data and others collected by Labjv and his i?clleagues quite 
remarkable in the evidence tiiey provide for listeners' responsi >reness to 
phonetic variables they detect in conversation. In mtvr-^zl o <:.tversational 
settings, talkers use phonetic variation to psychologl*>^I and .uol al ends; 
and, necessarily given that, listeners are sensitive to th^r/r i- use'ii. 

>^at Enables Phonetically-Structured Vocal-Tract Activity Do Linguisti c 

Work and How Is That Work Apprehended ? ^ ' "* 

Confronted with language per^^ption and use, an event theory faces 
powerful challenges. Gibson's theory of perception {1966, 1979) depends on a 
necessary relation between structure in informational media and properties of 
events. Obviously, physical law relating vocal tract activities to acoustic 
consequences satisfies that requirement well. But how is the relation between 
word and referent, and, therefore, between acoustic signal and relerent to be 
handled? These relations are not universal — thac is, different languages use 
differeru. words to convey similar concepts. Accordingly, in one sense, they 
are not necessary and n.^c, apparently, governed by physical law. 

I have very little to offer concerning an event perspective on linguistic 
events (but £^ee Verbrugge, 1985), and what I do have to say, I consider very 
tentative indeed. However, I wtjld like to address two issues concerning the 
relation of speech to language. Stated as a question, the first issue is: 
what allows phonetically-structured vocal- tract activity to serve as a 
meaningful message? The second asks: can speech qua linguistic message be 
directly perceived? 

As to the first question, Fodor (197^) observes that there are two types 
of answer that can be provided to questions of th.^ form: "What makes X a Y?." 
He calls one type of answer the "causal story" and the other the "conceptual 
story." To use Fodor' s example, in answer to the question: "What makes 
Wheaties the Breakfast of Champions?", one can invoke causal properties of the 
breakfast cereal, Wheaties, that turn nonchampions who eat Wheaties into 
champions. Alternatively, one can make the observation that disproportionate 
numbers of champions eat Wheaties. As Fodor points out, these explanations 
are d^^tinct <m6 not necessarily competing. 
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In reference to the question, what makes phoneticalli structured 
vocal-tract activity phonological (that is, what makes it jerve a linguistic 
function), one can refer to the private linguistic competences of speakers and 
hearers that allow them to control their vocal tracts so as to produce 
gestures having linguistic 5' gnificance. Alternatively, one can refer to 
properties of the language user's "ecological niche" that support linguistic 
communication. Vocal-tract activity can only constitute a linguistic message 
in a setting in which, historically, appropriately constrained vocal-tract 
activity has done linguistic work. A listener's ability to extract a 
linguistic message fi'otn vocal-tract activity may be given a "conceptual" (I 
would say "functional") account along lines such as zhe following: Listeners 
apprehend the linguistic work that th' phonetically-structured vocal tract 
activity is doing by virtue of their sensitivity to the historical and social 
context of constraint in which the activity is performed. 

According to Fodor: 

Psychologists are typically in the business of supplying theories 
about the events that causally mediate the production of 
behavior. , .and cognitive psychologists are typically in the business 
of supplying theories about trie events that causally mediate 
intelligent behavior, (p. 9) 

He is correct; yet there is a functional story to be told, and I think that it 
is an account that event theorists will want to develop. 

As to the second question, whether a linguistic message can be said to be 
perceived in a theory of perception from a direct-realist perspective, 
(direct) perception depends on a necessary relation between structure in 
informational media and its distal source. But as previously noted, this does 
not appear to apply to the relation between sign and significance. 

Gibson suggests that linguistic communications, and symbols generally, 
are perceived (rather than being apprehended by cognitive processes), but 
indirectly. His use of the <iualifier "indirect" requires careful attention: 

Now consider perception at second hand, or vicarious perception; 
perception mediated by communications and dependent on the "medium" 
of communication, like speech sound, painting, writing or sculpture. 
The perception is indirect since the information has been presented 
by the speaker, painter, writer or sculptor, and has been selected 
by him from the unlimited realm of available information. This kind 
of apprehension is complicated by the fact that direct perception of 
sounds or surfaces occurs along with the indirect perception. The 
sign is often noticed along with what is signified. Nevertheless, 
however complicated, the outcome is that one man can metaphorically 
see through the eyes of another. (1976/1982, p. ^12). 

By indirect, then, Gibson does not mean requiring cognitive mediation^ 
but rather, perceiving information about events that have been packaged in a 
tiered fashion, where the upper tiers are structured by another 
percei ver/actor • 
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What is the difference for the perception of events that have a level of 
indirect as well as of direct specification? I do not see any fundamental 
difference in the manner in which perception occurs, although what is 
perceived is different. (That is, when I look at a table, it see itTl^hen I 
hear a linguistic communication about a table, I perceive selected informati on 
about tables , not tables themselves.) 

When an event is perceived directly, it is perceived by extraction of 
information for the event from informational media. When a linguistic 
communication is indirectly perceived, information for the talker's 
vocal-tract activities is extracted from an acoustic signal. The vocal-tract 
activity (by hypothesis) c onstitutes phonetically-structured words organized 
into grammatical sequences, and thereby indirectly specifies whatever the 
utterance is about. 

It is worth emphasizing that the relation between an utterance (uttered 
in an appropriate setting) an^i v^. : it signifies necessary in an important 
sense. The necessity is net cue to physical law directly, but to cultural 
constraints having evolv:.J over generations of language use. These 
constraints are necessary in tnat anyone participating in the culture who 
communicates linguistically with members of the speech community must abide by 
them to provide information to listeners and must be sensitive to them to 
understand the speech of others. 

Indeed, in view of this necessity, it 3ef;m3 i-*j6.:iible th^t the distinction 
between direct and indirect perception ?ould be dispeised with in this 
connection. Both the phonetically-structured voce ■ - tra<it activity and the 
linguistic information (i.e., the informatioxi that the talker is discussing 
tables, for example) are directly perceived (by hypothesis) by the extraction 
of invariant information frop the acoustic signal, although the origin of the 
information is, in a sense, different. That for phonetic structure is 
provided by coordinated relations among articulators; that for the linguistic 
message is provided by constraints on those relations reflecting the cultural 
context of constraint mentioned earlier, Wh?'^ is "indirect" is apprehension 
of the table itself — whioh is -lot directly experienced; rather, the talker's 
perspective on it is perceived. Therefore, it set^ms, nothing is indirectly 
perceived , 

I have attempted to minimize the differences between direct and 
"indirect" perception. However, there is a difference in the reliability with 
which In.' ormatlon is conveyed. It seems that this must have to do with 
another- sort of mediation involved In linguistic communications. As already 
noted, in linguistic communications the information is packaged into its 
grammatically structured form by a talker and not by a lawful relation between 
an event and an informational medium. And as noted much earlier, talkers make 
choices concerning what the listener already knows and what he or she needs to 
be told explicitly. Talkers may guess wrong. Alternatively, they may not 
know exactly what they are trying to say and therefore may not provide useful 
information. For their part, listeners, knowing that talkers are not entire] v 
to be trusted to tell them what they need to know, may depend relatively more 
on extra-perceptual guesses. 
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F out not es 

^It may be useful to be explicit about the relationships among some 
concepts I will be referring to. Events are the primitive components of an 
"ecological" science— that Is, of a study of act or/ per eel vers In contexts that 
preserve essential properties of their econlches. In the view of many 
theorists who engage In such studies (see, for example, the quotation from 
Shaw et al. above), the only viable version of a perceptual theory that can be 
developed within this domain Is one that adopts a direct-realist perspective. 
I will take this as essential to the event (or ecological) approach, although, 
imaginably, a theory of the perception of natural events might be proposed 
from a different point of view. 

^There are fundamental similarities between the view of speech perception 
frcm a direct-realist perspective and fran the perspective of the motor 
theory.. An Important one 13 that both theories hold that the listener's 
percept corresponds to the talker's phonetic message, and that the message is 
best characterized in -srticulatory terms. 

There are differences a well. As Liberman and Mattingly (1985) note, 
one salient difference is that the direct-realist theory holds that the 
acoustic signal is, in a sense, transparent to the perceived components of 
speech, while the motor theory does not. According to the motor theory, 
achievement of a phonetic percept requires special computations on the signal 
that take into account both the physiological-anatomical and the phonetic 
constraints on the activities of the articulators. A second difference is 
more subtle and perhaps wili disappear as the theories evolve. Liberman and 
Mattingly propose that the objects of speech perception (at the level of 
description under consideration) are the "control structures" for observed 
articulatory gestures. Due to coarticulatory smearing, these control 
structures are not entirely redur-iant with the collection of gestures as they 
occur. My own view is that the smearing is only apparent, and, hence, the 
control structures are wholly redundant with the collections of articulatory 
gestures (properly described) constituting speech. 

^This characterization may appear patently incorrect in cases where the 
same articulator is involved simultaneously in the production of more than one 
phonetic segment (for example, the tongue bod^ during closure for [kh] in 
"key" and "coo" and the Jaw during closure for [b] in "bee" and "boo"). 
However, Saltzman and Kelso (Saltzman, in press; Saltzman & Kelso, 1983) have 
begun to model this as overlapping but separate demands of different control 
structures on the same articulator and my own findings on perceived 
segmentation of speech (Fowler, 1 984; see also Fowler & Smith, 1 986) suggest 
that perceivers extract exactly that kind of parsing of the speech signal, 
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**Javkin (1976) has provided evidence ^or the opposite kind of error. In 
his research, listeners heard vowels as longer before voiced than voiceless 
consonants, perhaps because of the continuation of voicing during the 
consonant. 
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THE Dy>;> >tICAL PERSPECTIVE ON SPEECH PRODUCTION: DATA AND THEORY* 



J. A. S. Kelso, t E. L. Saltzman and B. Tullert 



Abstract , Presented here, in preliminary form, is a general 
theoretical framework that seeks to characterize the lawful 
regularities in articulatory pattern that occur when people speak. 
A fundamental construct of the framework is the coordinative 
structure , an ensemble of articulators that functions cooperatively 
as a single task-specific unit. Direct evidence for coordinative 
structures in speech is presented and a control scheme that realizes 
both the contextually-varying and invariant character of their 
operation is outlined. Importantly, the space-time behavior of a 
given articulatory gesture is viewed as the outcome of the system*s 
dynamic parameterization, and the orchestration among gestures is 
captured in terms of intergestural phase information. Thus, both 
time and timing are deemed to be intrinsic consequences of the 
system* s dynamical organization. The implications of this analysis 
for certain theoretical issues in coarticulation raised by Fowler 
(1980) receive a speculative, but empirically testable, treatment. 
Building on the e>:istence phase stabilities in speech and other 
biologically significant activities, we also offer an account of 
change in articulatory patterns that is based on the nonequilibrium 
pMse transitions trea^f^d by the field of synergetics. Rate scaling 
studies in speech and time lal activities are shown to be consistent 
with a synergetjc .cerpretation and suggest a principled 
decomposition of ianfvisges. The CV syllable, for example, is 
observed to represent a stable articulatory configuration in 
space-time, thus rationalizing the presence of the CV as a 
phonological form in all languages (e.g., Clements & Keyser, 1983). 
The uniqueness of the present scheme is that $taBiiity and change of 
speech action patterns are seen as different manifestations of the 
same underlying dynamical principles — the phenomenon observed 
depends on which region of the parameter space the system occupies. 
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Though probably wrong, ambitious, and the outcome of much idle 
speculation, the simplicity of the present scheme is attractive and 
may offer certain unifying themes for the traditionally disparate 
disciplines of linguistics, phonetics, and speech motor control. 

Prologue 

The present paper represents, in part, a program of research that seeks 
to understand the lawful regularities that occur in articulatory patterns when 
people speak. The term dynamical in the title should not be interpreted as 
pure biomechanics. Rather, dynamics is used here in the fashion of Maxwell 
(1877), a forerunner of modern treatments of dynamical systems; namely, as the 
simplest and most abstract description of the forms of motion produced by a 
system. In a complex system like that of speech production, it :L- clearly 
impossible to investigate the behavior of each microscopic degree of freedom, 
however one defines them. The challenge of a dynamical approach is to 
identify and then lawfully relate macroscopic parameters (that operate on slow 
time scales) to the behavioral interactions among more microscopic 
articulatory components (that operate on faster time scales). A putative, but 
important advantage of a dynamical approach that, in principle, may allow for 
a unification of linguistics, phonetics, and speech motor control is the 
level-independent nature of dynamical description. Thus, dynamics can be 
specified at a global abstract level for a system of articulators as well as 
at the loca] , concrete le.t^i of muscle-joint behavior. The issue then becomes 
less one of translating a "timeless" symbolic representation into space-time 
articulatory behavior, as it :s one of relating dynamics that operate on 
different intrMsic time scales. Obviously, this is only a way of posing the 
problem, but we believe — in the absence of evidence to persuade us 
otherwise — that it offers a principled solution. 

1 . Introduction 

When a speaker produces a word, it is well known that the physical 
description of the word (whether acoustic, as displayed in a spectrogram or 
waveform, or articulatory, as in a cineradiographic sequence) varies widely 
with many factors. Among these are the rate at which the speaker talks, the 
word^s pattern of syllabic stress or emphasis within an utterance, and the 
phonetic structure of surrounding words. The variations that arise as a 
consequence of such factors have long resisted unified systematic 
descriptions. Despite intensive research efforts, no one has sufficiently 
described either the acoustic or articulatory information that serves to 
specify a word in all its various contexts. Nor has anyone, to our knowledge, 
identified a canonical shape for a word and then transformed it, in a 
principled fashion, into the many other shapes that it may take. 

Along with colleagues at Haskins Laboratories (e.g., Browman, Goldstein, 
Kelso, Rubin, & Saltzman, 198^1) we are developing a solution to this problem 
by treating the units of language — conventionally described by linguists and 
phoneticians as, for example, phonemes, syllables, and words — as the product 
of time-invariant control structures for a system vocal tract articulators. 
We assume that it is from the properties dynamically specified 

control structures, to be described presently. ne observed priysical 

variations naturally arise. 

17. 
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A central hypothesis derived fron the present theoretical framework is 
that articulators seldom move in an isolated, independent fashion 
(cf. Bernstein, 1967).* In speech prod' .'tion, they are coordinated with one 
another in such a way that char:.:;S over time in vocal tract shape are 
produced. Such changes in vocal tract shape structure the sounds of speech 
for a listener. A central problem for the theory, then, which we nhall 
address in the present paper, becomes one of characterizing interarticulator 
cooperation in a multidegree of freedom system, and identifying the 
"significant informational units of action" (Greene, 1971, p. xviii) for 
speech. We and others have provided theoretical and empirical support for the 
hypothesis that, for skilled movements of the limbs or speech articulators, 
such action units (or coordinative structures ) do not entail rigid or 
hardwired control of joint and/ or muscle variables (e.g» Fcwler, 1977; Kugler, 
Kelso, & Turvey, 1980; Turvey, 1977; for recent reviews see Kelso, in press; 
Kelso & Tuller, l98^a). Rather, they are defined more abstractly in a 
task-specific manner, and serve to marshall the articulators temporarily and 
flexible into functional groupings or ensembles of joints and muscles that can 
accomplish particular goals. But what principles govern the assembly of 
coordinative structures for speech and how can such structures be explicitly 
modeled? 

In Section 2 of the present paper we present evidence of coordinative 
structures in speech and discuss how they might be used in the production of 
single syllable utterances . Our focus is primarily on the task-specific 
stability of coordinative structures in the face of either 
experimentally-induced mechanical perturbations, or "natural" perturbations 
that might occur during ongoing speech as a result cf contextual variations. 
A key feature of the model we are developing, termed task dynamics (e.g., 
Saltzman & Kelso, 1983/in press) is that it allows one to define invariant 
control structures for specific vocal tract gestural goals, from which 
contextually varying patterns of articulatory trajectories arise. These 
structures are invariant in two ways, both qualitatively, in terms of the set 
of relations among dynamic parameters (analogous to mass, stiffness, damping, 
etc.), and quantitatively, in terms of the parameter values themselves. 

Speaking a word entails laryngeal and supralaryngeal gestures involving 
coordinated activity of many articulators. But words are not simply strings 
of individual gestures, produced one after the other; rather, each is a 
particular pattern of gestures, orchestrated appropriately in space and time. 
A way to probe the nature of the underlying orderias: '^ocess is to induce 
naturally occurring scaling transformations, such a iges in speaking rate 

and degree of prosodic stress, and search for unose aspects of the 
articulatory pattern that remain stable across these transformations. In 
Section 3 of the present paper we reconceptualize and perform an extensive 
reanalysis of earlier work that showed that the relative timing among 
articulators was a crucial feature of intergestural coordination. These steps 
point to the importance of phase as a key dependent variable, a finding that 
has empirical and theoretical implications for understanding both the 
stability of the spatiotemporal orchestratio!> among gestures and the dynamical 
control structures that underlie such patterns (Kelso & Tuller, 1983a, 1985b). 
We outline a dynamical account of speec^h production^ that differs radically 
from views that characterize speech as a planned sequence of static 
linguistic/ symbolic units that are different in kind frcnn the physical 
processes involved in the execution of such a plan. Rat^"er, we hypothesize 
that the coordinative structures for speech are dynamically defined in a 
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unitary way across both abstract "planning^* and concrete articulatory 
"production" levels. These units are not timeless, but rather incorporate 
time in an intrinsic manner (cf. Bell--Berti & Harris, 1981; Fowler, 1980). 

In the final section we discuss new directions — experimental and 
theoretical-- for enlarging our underfitaiiding of the subtleties of dy. -^rnical 
structure that underlie changes in critically scaled articulatory patterns. 
We speculate that the form of such changes may in fact offer a window into, 
and perhaps even rationalize, the basic units of phonological analysis. 

2. On Coordi native Structures in Speech 
2.1 Theory and Data 

The production of a single syllable requires the cooperation among a 
large number of neuromuscular components at respiratory, laryngeal, and 
supralaryngeal levels, operating on different time-scales. Yet somehow from 
this huge dimensionality the .ound emerges as a distinctive and well-formed 
pattern. How this "compression'" occurs — from a microscopic basis of huge 
dimensionality to a low-dimensional macroscopic description — is central to 
many realms of science, not only to understanding the coordination among 
speech articulators (see e.g., Kelso & Scholz, 1985). For example, there are 
many neurons, neuronal connections, metabolic ^^r.iponents , muscles, motor 
units, etc- involved in pointing a finger at a .^get, yet the action itself 
is nicely modeled by a masS-spring system, a point at tractor dynamic in which 
all system trajectories converge asymptotically at the desired target (e.g, 
Cooke, 1980; Kelso, 1977; Polit & Bizzi, 1978; Schmidt & McGown, l^'JO). 

Is it, in fact, the case in speech that the higher dimensionality 
available actually reduces to a lower-dimensional, controllable system? If 
so, on what principles does such compression or reduct:! .n of the many degrees 
of freedom rest? These questions amount to a basic problem in the control of 
complex systems, that is, determining the circumstances under which a small 
set of control parameters (K) can effectively manipulate a much lar^^er number 
of degrees of freedom (N). As Rosen (1980) notes, it is usually the case that 
K <<N, so that unless further constraints are imposed, it is not possible to 
impose arbitrary controls on N degrees of freedom. 

But what form do such constraints take? Is there any evidence that the 
many degrees of freedom are actually constrained in a systematic fashion when 
a person talks? In earlier work, Fowler (1977, 1980) has described some 
mostly indirect evidence that the many neuromuscular components involved in 
speech do, in fact, cooperate to form functionally-specific action units, or 
as we prefer to call them, coordinative structures (e.g., Turvey, 1977). Here 
we supply more direct experimental support. 



Support for the hypothesis that a group of relatively independent muscles 
and Joints forms a single functional unit would be obtained if it were shown 
that a challenge or perturbation to one or more members of the group was, 
during the course of activity, responded to by other remote (non-mechanically 
linked) members of the group. We have recently found that speech articulators 
(lips, tongue. Jaw) produce functionally specific, near-immediate compensation 
to unexpected perturbation, on the first occurrence, at sites remote from the 
locus of perturbation (Kelso, Tuller, & Fowler, 1982; Kelso, Tuller, 
V.-Bateson, & Fowler, 198^). The responses observed were specific tc the 
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actual speech act being performed: for example, when the Jav; was suddenly 
perturbed while moving toward the final /b/ closure in /baeb/, the lips 
compensated so as to produce the /b/, bul no ccTiDensation was seen in the 
tongue. Conversely, the same perturbation applied di.: the utter^ance /baez/ 
evoked rapid and increased tongue muscle activity (appropriate for achieving a 
tongue-palate configuration for the final fricative sound) but no -^tive lip 
compensation. 

In order to explore the microscopic workings of a coordinative structure, 
recent work has also varied the phase of the jaw perturbation during bilabial 
consonant production. Remote reactions in the upper lip were observed only 
when the jaw was perturbed during the closing phase of the mooion, that is^ 
when the reactions were necessary to preser^ve the identity of the spokt^; 
utterance (see also Munhall & Kelso, 1985). Thus' the form of cooperation 
observed is not rigid or "hard wirtri": the unitary process is flexibly 
assembled to perform specific funcUons (for additional evidence in speech and 
other activities, see Abbs, Gracco, & Cole, 198^^; Berkenblit, Fei'dman, & 
Fukson, in press; Kelso et al . , 198^). Elsewhere we have drawn par^l^el^ 
between these findings and brain function in general (Kelso & Tuller, 1984a/, 
Just as groups of cells, not single cells, appear to be the main units of 
selection in hi^-ier brain function (Edelman & Mountcastle, 1978), so too 
task-specific ensembles of neuromuscular elements appear to be significant 
units of control and coordination of action, including speech. 

To propose the coordinative structure as a fundamental unit of action 
does not just involve a change in terminology. Its purpose is to take us away 
from the hard-wired language of reflexes and central pattern generators (CPG) 
or the hard-algorithmed language of computers (formal machines), which is the 
source of th- Jir:tor program/CPG idea. RefJ -es and CPGs may be viewed as 
elemental eni,it: 5, but they are i.ot r. ' menta l, in terms of affording an 
understanding coherent action. ? fact that we observe 

functionally ^Aific forms of coop' behavior in many different 

creatures (e.g, the wiping behavior of th spin 1 frog; Berkenblit et r1., in 
press; Fukson, Br--kenblit, & Fel'dman, 1980 with vas^ y different 
neuroanatOTTiies sugge; ; that there may be nothing s^e.ial, a p. 'ori, about 
neural structures and their "v;iring" that mandates the "exist eiioe of 
coordinative structures. Rather, it suggests that the functional 
cooperativity — not the neural mechanism per se~is fundamental. Although 
neural processes serve to instantiate such functions and support such 
cooperative behaviors, it is the lawful dynamical (rather than neural) basis 
of these cooperative phenomena that is our primary theoretical and 
experimental concern. This is where we part company with certain current 
views of motor control. Contrary to the motor programming formulation -hat 
relies on symbol-string manipulation familiar to computer technology (and, we 
would add, th whole "information processing" perspective of cognitive science 
CCarello, Turkey, Kugler, & Shaw, 198^^]), the co.-istruct of coordinative 
structures highlights both the analytic tools of qualitative (nonlinear) 
dynamics (e.g., Kelso, Holt, Rubin, & Kugler, 1981 ; Kelso, V.-Bateson, 
Saltzman, & Kay, 1985; Saltzman & Kelso, 1983/in press), which provide 
low-dimensional descriptions of forms of motior produced by high dimensional 
systems, and the physical principles of cooperative phenomena (e.g., Haken, 
1975; Haken, Kelso, & Bunz, 1985; Kelso, 1981; Kelso & Tuller, 198i^a, 198^b; 
Kugler, et al., 1980; Kugler, Kelso, & Turvey, 1982), which account for the 
emergence of order and regularity in nonequilibrium, open systems. Though 
preliminary, both approaches will be apparent below and in following sections. 
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2.2 Ta.?k Dynamic M odel ... 

One way of trying to understand the ope.-^ation of a coordinative jcture 
is to model it. What type of model could generate, in a task-speci:' o ;r,anner, 
the trajectori.5 characteristic of normal unperturbed speech gestures and the 
spontaneous, compensatory behaviors discussed above? Here we discuss briefly 
how these issues of multiarticulator coordination within single speech 
gestures are treated in a task-dynamic r.odel (Saltzman, 1985/in press; 
Saltzman & Kelso, 19£3/in press") recently developed for effector systems 
having jr.any articulatory degrees of freedom. Finally, we describe some 
preliminary attempts to model multiarticulator coordination within two 
temporally overlapping speech gestures, with reference to "naturally" induced 
compensatory behaviors (i.e., coarticulation) . 

Task dynamics is able to model the phenomenon of immediate compensation 
without requiring explicit trajectory planning or repianning (see Saltzman & 
Kelso, 1983/in press, for further details). Note that defining invariant 
patterns of dynamic parameters at the level of articulatory degrees of freedom 
(e.g., stiffness and damping parameters for the jaw and lips) will not suffice 
to generate these behaviors. The immediate compensation data for speech 
described above (Kelso et al., 1984) could not be generated by a system with a 
constant rest configuration parameter (i.e., a vector whose components are 
constant rest positions for the lips and Jaw, cf. Lindblom, 1967). As shown 
in these data, when sustained perturbations were introduced during 
articulatory closing gestures, the system "automatically" achieved the same 
constriction as for an unperturbed gesture, but with a different final or rest 
con! x;:iUration. Thus immediate compensation appears to result from the wa> 
that dynamic parameters at the articulatory level ;.re constrained to change 
during a gesture in a context-dependent manne: , In ^le task-dynamic model, 
such patterns of constraint originate in corresponding invariant patterns of 
dynamic parameters at an abstract, functionally defined level of task 
du-rvcription. 

There are three main steps involved in simulating coordinated movements 
of the speech artioulcxtora using the task-dynamic model. Since simulations to 
date have focused mainly on bilabial gestures, we will describe these three 
steps in some detail with reference to the specific example of a discrete 
bilabial closure task. 
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iask space . The first step is to specify the functional aspect-:! of the 
gi\'*-n speech gesture with reference to the constriction-forming movements of 
an idealized vocal tract. This is done in a two-dimensional task space whose 
axes represent constriction location and constriction degree, and the 
topological form of the control regime for each task-space variable : ^ 
specified according to the functional characteristics of the given speech 
task. For example, discrete and repetitive speech gestures will have damped 
(e.g., point attractor) and cyclic (e.g., limit cycle) second-order system 
dynamics, respectively, along each axis. At the task-space level, then, the 
dynamical system or control regime is abstract in that the constriction being 
controlled is independent of any particular set of articulators, and can 
refer, for example, to either a bilabial constriction produced by the lips and 
Jaw or to a tongue-palate constriction produced by the tongue and jaw. Since 
we have chosen a discrete closure task to illustrate the steps involved in our 
task^-dynamic simulations, we specify invariant, damped, second-order dynamics 
for the articulator-independent constriction along each task axis (see Figure 
la). 
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Figure 1. B. labial tasks. a. Task space: variables are constriction 



location (CL) and constriction degree (CD). Closed cirole denotes 
current sysceni configuration. Sawtooth represents axis dynamics in 
JuHiped form. b. Body space: tract variables are lip protrusion 
(LP) and ap'^rture (LA). MT and LT denote positions uoper and 

1 ower teeth, respectively. Model articul tor space: 

articulator variables are jaw angle (J), upper lip ver* icai (ULXj), 
lower lip vertical (LLX^), and lip horizontal (LXj. 



Body space ; Tract variables The ?econd step in modeling bilabial closure 
is to transform the task-space system kinemalically into a two-dimensional 
body-space system defined in the midsagittal plane of the vocal tract. In 
CO: trast to the task-space regime, the bod^'-space dynamics are specific to a 
given set of articulators whose movements go. -^n the bilabial constriction 
along the tract variable dimensions of i'^;^ aporture (LA) and lip protrusion 
(LP). These tract varia^des represent th*^ body-space counterparts of the 
task-space variables O'' constriction degree and location, respectively (see 
Figure lb). Lip aperture is defined by the vertical distance between the 
upper and lower lips, and lip protrusion by the horizontal distances in the 
anterior-posterior direction of the upper and Tower lips from the upper and 
lower teeth, respectively. Upper and lower lip protrusion movements are not 
independent in our preliminary formulation, but have been constrained to be 
equal in the model for purposes of simplicity. Consequently, like 
constriction location in task space, lip protrusion in body space currently 
constitutes only a single degree of freedom. This constraint may be abandoned 
in future work, as we attemi^t to model gestures in which the upper and lower 
lips show very different horizontal positions (e.g., labiodental fricatives). 
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Finally, it should be noted that the result of transforming from task space to 
body space coordinates is to define a two dimensional set of motion equations 
wit^ a constant (although transformed) set of dynamic parameters. The tract 
variable control regimes ^re ind ^iendent, since their corresponding equations 
of motion are uncoupled. 

Model articulator space . The third step in modeling the closurt task is to 
transform kinematically the two-dimensional tract variable regime into the 
coordinates of a four-dimensional model articulator space. The model 
articulators are moving segments that have lengths but are massless (see 
F'-r-.-e 1c), and are defined with reference to the simplified articulatory 
dt^r*- / »s of freedom adopted in the Haskins Laboratories software articulatory 

och synthesizer (Rubin, Baer, & Mermelstoin, 1981'. For bilabial gestui-'^'S, 
the set of articulator movements associated wiun lip aperture includes 
rotation of the jaw and vertical displacements of the upper lip and lower lip 
relative to the upper and lower front teeth, respectively; for lip protrusion, 
the set of articulator movements includes (currently) yoked horizontal 
displacements in the anterior-posterior direction of the upper and lower lips 
relative to the upper and lower front teeth, respectively. 

Since there are r.ore model articulator variables than tract variables for 
the bilabial closure task, the moc>l articulator system is redundant and the 
inverse kinematic transform from tract variables to model ar* "culator 
coordinates is indeterminate (e.g., Saltrman, '979). In o'-der to ueal with 
the indeterminacy or one-to-many property of this transformation, a weighted, 
least-squares optimality constraint is irwoduced in the form of a weighted 
i^2k\3P Dseudoinverse transformation. This pseudoinverse has also been used 
in conu'Ol schemes for robot arms tnat have a surplus number of deg^»>ef? 
freedom (i.e., the number of jr.' its in the arm is greater than the numl:.- uf 
task-relevant, spatjal degrees of freedom for the hand, e.g., Benati, ^^i^/ 
Morasso, Tagliasco, i Zaccaria, 1980; Klein & Huang, 1983; Whitney, 19Y2}. 
Specifically, the pseudoinverse is a function of two matrix components—tne 
Jacobian and articulator weighting matrices. The Jacobian matrix defines the 
transformation that relates motions of the articulators at their current 
configuration or posture to corresponding tract-variable motions of the 
bilabial constriction. The elements of the Jacobian matrix are nonlinear 
functions of the current articulatory posture. The elements of the 
articulator weighting matrix, however, are constant during a given gesture. 
In current modeling, a given set of articulator weightings constrains the 
motion of an articulator in direct proportion to the relative mag? .tude of the 
corresponding weighting element. hence, different articulator weighting 
patterns are associated with different amounts of relative motion on the p^^rt 
of the four articulators responsible for controlling the tract variables of 
the bilabial constriction. In this sense, elements of the articulator 
weighting matrix used in the associated pseudoinverse define a further set of 
constant parameters for the bilabial constriction*s equation of motion. 

To summarize, in the task-dynamic model one may int;3rpret the 
task-specific, coherent movements of the model articulatory system as 

resulting from the way that instantaneous tract-variable "forces" acting on a 

particular vocal tract constriction are distributed across the model 

articulators dUi-^ing the course of the tract V3»"i?:bi<r gesture. At any given 
instant during this gesture, the part itio/, mg ;r La^od on two factors: 
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a) the task-specific, constant set of articulator weightings and 
tract-variable dynamic paramrters (e.g., lip aperture stiffness ar j 
damping) ; and 

b) the current values of elements in the posturally dependent Jacobian 
matrix. Because these elements are functions of the current posture 
of the model articulators, the dynamic parameters defined at the 
level of the model articulator variables (e.g., stiffness and damping 
of the jaw, upper lip, and lower lip) are also functions of the 
evolving articulatory configuration. 

Example 1^: Discrete bilabial closures: Unperturbed gestures * Given a 
fixed set of dynamic parameter valuer for the tract variables of lip aperture 
and lip protrusion, and a set of initial positions and velocities for the jaw. 
upper lip, and lower lip, the equations of motion for the model articulators 
will generate a pattern of coordinated articulatory movements that will 
achieve the task goal (e.g., bilabial closure) specified for the tract 
variables. For an initial configuration corresponding to open and relatively 
unprotruded lips, and with initial articulator velocities of zero, these 
coordinated movements will reflect the evolving task-'specif ic motions of the 
tract variables en route to their specified targets, with motion 
characteristics (e.g., speed, degree of overshoot, etc.) specified by the 
pattern of tract-variable dynamic parameter values. Assuming the system is 
not perturbed during its motion trajectory, the relative extents of movement 
for the jaw and lips will be specified by the relative values of the 
associated articulator weightings. Thus, one weighting pattern might 
correspond to predominant jaw motion, while a second weighting patter p. might 
correspond to predominant vertical motion of the lips for a given lip aperture 
trajectory. 




Figure 2. Simulated articulator conf iguratic .-^ for bilabial closure task. a. 

Initial configuration (solid lines); b. Final configuration, 
unperturbed trajectory (dotted lines); c. Final configuration, 
perturbed trajectory (broken lines). Note that closure .'^^s 
lower in jaw space in c than in b. J = jaw axis, UT = upper n, 
UL « upper lip, LT = lower teeth, LL = lower lip. 
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Figure 2 (configurations a and b) illustrates an unperturbed movement from 
an initially open and relatively unprotruded configuration (Figure 2a) to a 
closed and relatively protruded final configuration (Figure 2b). Since the 
articulators associated with lip aperture were weighted equally in the 
corresponding weighting matrix, the extents of motion for these articulators 
were equal over the course of the gesture. 

Example 2: Bilabial closure, iirmediate compensation, perturbed gestures . 
As discussed in Section 2.1, Kelso et al. (198^) demonstrated that if the jaw 
was retarded en route to a bilabial closure for /b/, the closure was still 
attained and the final articulatory configuration for the perturbed movement 
was different from the final configuration for unperturbed movements. 
Significantly, upper lip compensation was absent ii the Jaw was perturbed en 
route to an alveolar closure for /z/. Thes: results show that an invariant 
dynamic description of a movement doez net apply at the 5=irticulator level, 
since the articulatory-dynarr'. c parameters (e.g., rest-configuration) must be 
able to change according to a movement's context in an utterance-specific 
(i.e., /b/ vs. /z/) manner. Furthermore, the speed of these compensatory 
behaviors suggests that they must oc: ur ''automatically" without reference to 
traditional stimul<if^Tesponse reaction-time correction procedures. 

The task-dynamic model handles such immediate compensation as follows. 
Bilabial closing gestures are simuj-ater" as discrete movements toward target 
constrictions, using point at^ractor dynamics for the local cract variables of 
lip aperture and protrusion. When the simulated jaw is "frozen" in place 
during the closing gesture at th-> level of the model effector system, the main 
qualitative features of the perturbation data are captui^ed, in that: a) 
compensation to the Jaw perturbation is immediate in the upper and ; 
i.e., the system does not require reparaiiic terization in order to t 
and b) the target bilabial closure is reached (although with diff' 
articulator .-configurations and, hence, different jaw-space loodtio the 
closure) for both perturbed (Figure 2C^ and unperturbed (Figure LB) ils." 

Exc- jpie 3- Coarticulation, gestural coproduction, bilabial and tongue 
c ■ X gestures . In the task dynamic model, coart iculatory effects may 

originate in two ways. Passive carryover effects that are due to inherent 
system "sluggishness" (i.e., the time constants of the different tract 
variables; see also Henke, 1966; Coker, 1976) are implicit in the functioning 
of the model. Additionally, and more interestingly, ether coarticulatory 
effects (both anticipatory and carryover) result from the temporally 
overlapping demands (conflicting or synergistic) made by the same or different 
tract variables on a common articulator subset (e.g., bilabial and tcngue 
dorsum gestures with reference to the shared jaw articulator). We have begun 
to model these latter "active" coarticulatory effects by using the 
articulatory synthesizer to define articulator subsets for two new tract 
variables associated with the vocal tract constriction formed by movements of 
the tongue body dorsum. Thus, constriction location for tongue body dorsum is 
associated with the articulatory degrees of freedom of jaw rotation, and 
radial and angular displacements of "he tongue body relative to the jaw; 
constriction degree for tongue body dorsum is associated with the sane 
articular.ory subset. 

In F-'^eliminary simulations, we modelec the hypothetical cases in which 
bilabial and tongue dorsum gestures either did not overlap in cime cr were 
'•'^lly synchronous. Both gestures were identical in durational and damping 
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factors., and all articulators had equal weight ing:\ For tot'> gasture types in 
the nonoverlapping case, the model articula^^-^^ M:,;rr-vj .^t the same initial 
"neutral" configuration (corresponding tc t - c',.c' iips ?*rr' a schwa-like 
position for the tongue dorsum), and atl:\i\^ . Vk r^ct^pecixv<-^ bilabial 
closure and tongue dorsum constriction tar-f a. f .'nal ar t icul^-tory 
configurations were different for both gesture ^v:r- ii.:;, ai parti jior, the 
final jaw position for the single bilabial gest.:^o v higher than that for 
the single tongue dorsum gesture. Recalling prev ;.:?,.'= vi.iscussions in this 
section, these final conf igurational (and jaw positio'is. ' ) differences resulted 
from the different ways that the instantaneous, evo task space forces 

were distributed across each gesture type's articulatory subset during the 
course of the movement. Roughly speaking, if we focus cn the net "force" 
distributed to the jaw during the movement, we can say * .^^t more net force was 
delivered to the jaw durin? the simulated bilabial than during the tongue 
dorsum gesture, resulting in greater and lesser jaw displacements, 
respectively. Starting fron the same initial configuration but with 
synchronous gestures, both the bilabial and tongue dorsum targets were again 
reached. However, the final articulatory configuration was different from 
those observed when either of the gestures occurred in isolation. The final 
jaw height for the gesturally synchronous case was halfway between the final 
jaw positions attained for the nonoverlapping gestures. This compromise jaw 
position resulted from the fact that, in the model, the net force delivered to 
the jaw over the gesturally synchronous movement was (roughly) the v:e?ghted 
average of the net jaw forces delivered during each of the nonov« /^lapping 
gescur es. 

We are extending our simulations currently to include cases in which 
different gestures overlap only partially lu time (a more realistic assumption 
with reference to speech coarticulatory phenomena). In these cases, the net 
force distributed to the jaw (and hence total jaw displacement) during periods 
of gestural overlap will reflect the weighted averages of the jaw forces 
associated with each gesture over these periods. The predicted behavior of 
the model is consistent, in fact, with coarticulation data for V1CV2 
utterances presented by Sussman, MacNeilage, and Hanson (1973). As a first 
approximation toward modeling such utterances, we will treat bilabial 
consonants as closing gestures of a lip-jaw system dssociated with the tract 
variables for bilabial constrictions. Similarly, we will treat vowels as 
opening gestures of the jaw-tongue system associated with the tract variables 
for tongue dorsum constrictions. We realize, of course, that this description 
represents or*ly a preljrolMary, simplified account of the data, which will be 
modifiei as experiments ana simulations progress. For example, at least the 
early portions of consonantal release gestures appear depend on the manner 
class (e.g., stops vs. fricatives) of the consonants themselves. However, 
given these assumptions, we may represent the V1CV2 productions as temporally 
overlapping sequences of opening (vocalic) and closing (consonantal) tract 
variable gestures. Since the vowel and consonant gestures share the jaw as a 
common articulator, the net movement of the jaw during periods of gestural 
overlap (i.e., the period of jaw motion during which the VIC closing gesture 
overlaps the CV2 opening gestures) will be determined >y the weighted average 
of the respective "demands" made on the jaw each gesture during these 
periods. Hence, for example, the vertical upvard displacement of the jaw for 
a VIC gesture (and hence, the jaw height at closure) will be influenced by the 
height of V2. Specifically, the net upward demand or "force" delivered to the 
jaw for low V2 (/ae/) will be less during the period of gesturaJ overlap than 
it would be for high V2 Vi/), and should generate the anticipatory 
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coarticulatory effect of greater VIC displacement for high V2 than for low V2 
observed by Sussman et al. (1973), 

3. On Ge-^cural Orchestratio n; From Relative timing to Phase Stability. 

In the previous section we focused on the Intrinsic properties of 
functional units of action, i-ut have not discussed the sequencing or 
orchestration of these units over time. One way to explore the processes 
underlying such orchestration is to transform a given action pattern as a 
wnoie (e.g., by scaling on movement rate, amplitude, etc.) and search for what 
remains stable across the transformation. 

Much evidence now exists that the relative timing of movement events is 
stable across certain scaling changes and hence provides a more appropriate 
metric thin their absolute durations. Although early demonstrations of 
relative temporal stability were provided from activities that are 
qualitatively repetitive and potentially pre-wired (e.g., lococotion, 
respiration, and mastication; see Grillner, 1977, for review), more recent 
work has revealed that less repetitive activities show similar organizational 
features (e.g., two-handed movements, typing, handwriting, postural control 
and speech-manual coordination; Hollerbach, 1981; Kelso, Southard, & Goodman, 
UIV ^ * Harris, 1983; Lestienne, 1979; Nashner, 1977; Schmidt, 

1982; Shapiro, Zernlcke, Gregor, & Diestel, I98l ; Viviani & Terzuolo, 1980) 
Importantly, there is sane limited evidence that the production of speech can 
be described by a similar style of organization, and we will now describe this 
Work in some detail. 

In a set of previous experiments (Harris, Tuller, & Kel!?o, 1986; Tuller & 
Kelso, 198U; Tuller, Kelso, & Harris. 1982, 1983), Tuller and colleagues lave 
shown that, across variations in spelling rate and stress, the timing of 
artlculatory events associated with consonant production remains stable 
relative to the interval between events associated with flanking vowels. 
Consider a very simple, but paradigmatic case in which the latency (in ms) of 
onset of upper lip motion for a medial consonant is measured relative to the 
interval (in ms) between onsets of jaw motion for flanking vowels. 

Ir: Figure 3(top), we see the particular Intervals measured for one token of 
the utterance /baPAB/, spoken at a conversational rate with primary stress on 
the second syllable. The movement data were obtained by recording from 
infrared LEDs attached to the subject's lips and jaw. Here, the interval from 
VI to V2 represents the time between onsets of jaw motion for successive 
vowels. The interval VI -UL represents the latency of onset of medial 
consonant-related movement in the upper lip. These points were obtained fron 
zer^o crossings of velocity traces. The main empirical question was: Do t^e 
Intervals VI -V2 and VI -UL change in a systematically related way as syllable 
streps and speaking rate vary? 

Figure 3(bottcm), Uken from Tuller and Kelso (198iJ), plots <-he latency of 
upper lip movement relative to the vowel period for one of the four speakers 
The data were similar for all subjects. The utterance shown, /bapab/, spokf' 
at two rates and with two stress patterns, illustrates the main result: over 
changes in speaking rate and stress, the measured temporal intery?ls and 
articulrtory displacements change considerably, but the relative timing Is 
preserved. The overall relationship can be described by a linear function 
defined by two parameters— a positive slope and a nonzero intercept. "^This 
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high correlation of two event durations across rate and stress in diffez-ent 
speakers has since been replicated by other investigators (Bladon & 
Al-Bamerini, personal ccmtnunication ; Gentil, Harris, Horiguchi, & Honda, 198^4; 
Linville, 1982; Lubker, 1983; Munhall, in press). 
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Figure 3. Top. Movements of the jaw, upper lip, and lower lip corrected for 
jaw rnovement, and the acoustic signal, for one token of /ba#'pab/. 
Articulator position (y-axis) is shown as a function of time. 
Onsets of jaw and lip movements ere indicated (empirically 
determined from zero crossings in the velocity records). Bottom. 
Timing of upper lip lowering associated with /p/ production as a 
function of the period between successive jaw lowerings for the 
flanking vowels for one subject's productions of /ba#pab/, (t) 
Slow rate, first syllable stressed; (0) Slow rate, second syllable 
stressed; (A) Fast rate, first syllable stressed; (A) Fast rate, 
.-second syllable stressed. 
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How is this stabi-.ity of -elative timing to be rationalizer? A popular 
view jn the motor control literature is that time is metered out by a central 
program that instructs or commands the articulators when to move, how far to 
move, and for how long (e.g., Schmidt, 1982). However, a reconceptualization 
by Kelso and Tuller (I985a/in press) and subsequent reanalysis of the original 
data (Tuller & Kelsc, 198i^) strongly suggest that their findings can be 
understood without recourse to an extrinsic timer or timing metric' In 
fact, a very different view of articulatory "timing" emerges when the 
articulatory movements are reanalyzed as trajectories on the phase plane. 
These phase plane trajectories provide a geometric or kinematic description 
that usefully captures the forms of patterned motion produced by the 
articulators. A brief tutorial follows. 
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3. The Phase Portrait : A '^utorlal (of. Kelso, Tuller, i . I98^a/l986) 

All possible system suites can be represented in th: pnase plane, whose 
axes are the articulator's position (x) and its velocity ,xl. As time va-^ies, 
the point P (x, x) desciub^ng the motion of the articulator rr^oves along a 
certain path on the phas;? plane. Figure U illustrates the .iiapping from time 
domain tc phase plane trajectories. Hypothetical jaw and upper lip 
traj ect . ' ies (position as a function of time) are shown for an ans^^o.: ' -d 
/bab/ (I'^^ure ^a, left) and a stressed /bab/ (Figure ^b, left). On the ri^A 
are sh K.: the corresponding phase plane trajectories. In th*3 figure and 
those ' "Wing we have reversed the typical orientation of the phase plane so 
that on is shown on the vertical axis and velocity on ti>i' horizontal 

axis. Thus, downward movements of the jaw are displayed as dovnward movements 
of the phas- path. The vertical crosshair indicates zero 'f^locity and the 
horizontal crosshair indicates zero position (midway betweer* minimum and 
ma;^imum displacement). As the jaw moves from its highest to its lowest point 
(fro/n A to :: in Figure ^), velocity increases (negatively) to a local maximum 
(B), then decreases to zero when the jaw changes direction of movement (C). 
Similarly, as the jaw is raised fron the low vowel /a/ into the follov ng 
consonant constriction, velocity peaks approximately midway through the 
gesture (D), then returns to zero (A). 

Phase plane trajectories preserve ^ome important differei.ces between 
stressed and unstressed syllables. For examp,\e, maximum lowering of the jaw 
for the stressed vowel is greater than lowering for the unstressed vowel and 
maximum articulator velocity differs noticeably between these two orbits 
(e.g., Kelso et al., 1985; MacNeilage, Hanson, & Krones, 1970; Stone, 1981; 
Tuller, Harris, & Kelso, 1982). In contrast, the different durations taken to 
traverse the orbit as a function of stress are not represented explicitly in 
this description. That is, although time is implicit and usually recoverable 
from phase plane trajectories, it does not appear explicitly. 

It ■ possible to transform the Cartesian x,x coordinates into equivalent 
polar oordinates, namely, a phase angle, « tan-^ [x/x], and a radial 
amplitu/.v?, R = [x^ + x^]^''^. These polar coordinates are indicatiid on the phase 
planes shown in Figure ^. Th^:- pi^ase angle has been a key (computed) dependent 
variable in our re-analysis of interarticulator timing.** It allows us to 
rephrase the traditional question of how the lip knows when to begin its 
movement for the medial consonant by asking where on the cycle jaw states 
that th^ lip mouion for medial consonant production begins, one possibility 
is that lip motion begins at the same phase angle of the jaw across different 
jaw motion orbits (i.e., across rate and stress). This outcome is not 
necessarily entailed, or predicted by, the relative timing results. For 
example. Figure ^a through shows whree utterances whose vowel-to-vowel 

periods and consonant latencies do not chai ge in a linearly related fashion. 
Neverth eless, ';e phase angle at which upper lip motion begins relative to the 
cycle of jaw states is identical in the three cases. Thus, the information 
for "timing" of a remote articulator (e.g., the upper lip) may not be time 
itself, ror absolute position of another articulator fOn^., the jaw), but 
rather r relation?hip defined over the position-velocir.y orate (or, in polar 
coordinates, the phase angle) of the other articulator. Although this 
conceptualization is intrlj^uing, we want to re-emphasize that it constitutes 
an alternative description of the relative timing data set. For ex^.-nple. 
Figure 5 illustrates the converse of Figure ^, namely; that two (hypothetical) 
utterances with identical vowel- to -vowel periods (P) and consonant latencies 
(L) can nonetheless show very different phase angles for upper lip moverent 
onset. To be specific, the phase angle analysis incorporates the full 
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L. ajectory of motion; the reJative timing analysis is independent of 
trajectory once movement has begun and is based only on the onsets and offsets 
of movement events. 

Time series jow phcse pione 




Figure ^. Left: Time series representations of idealized -^-.uerances. Right: 
Corresponding jaw m:3v?ons, characterized as r simple mass spring 
aiid displayed on the 'functional' phase plane (i.e., position on 
t'rye vertical axis and velocity en the horizontal axis). Parts a, 
bf i^'d c, r'epresent three tokens with vowel-tc vowel periods (P and 
r' ) and consonant latencies (L and L') t■.h^t are not linearly 
^"•rl -.tpd. Phase position of upper lip movement onset relative to 
'^r? ;.^w cycle indicated (cee text). (From Kelso & Tuller, 

198^ 3 " or^3S> . 
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Figure 5. Two hypothetical utterances having identical vowel-to-vowel periods 
(P) and consonant (upper lip) latencies (L) but different phase 
angles of upper lip onset. See caption Figure ^. (From Kelso & 
Tuller, I985a/in press) 186 
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Figure 6 showt- :i5otion on t^* phase plane for the first cycle of 
/•ba#bab/ (top) Oi.u /ba#'bab/ (bv: procJuced at a fast rate. Each token 
shown is the firs^ Instance produced of the utterance type. On the left Is 
the entire jaw cycle for each stress pattern; on the right, the jaw cycle is 
reproduced only until the point of onset of upper lip movement downward for 
production of the medial bilabial consonant, as measured from the first 
deviation from zero velocity. The calculated phase angle^ at which upper lip 
motion begins is indicated for each token. Notice that the jaw displacement 
and velocity are both greater for the stressed than the unstressed syllable. 
Nevertheless, upper lip motion begins at essentially the same phase angle for 
both tokens. If upper lip motion began at a phase angle of 180^, it would be 
synchronous with the jaw "turnaround" point. 



Jow phosf plm trajtv 




Velociir'/I 
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Figure 6. Left: Jaw cycle on the phase plane for the first token produced of 
stressed /ba#b/ (top) and unstressed /b#ab/ (bottom), spoken at a 
fast rate. Right: Jaw cycle until the onset of upper 11d low^rjing 
for the second /b/. (From Kelso & Tuller, I985a/in prf 

3.2 New Results 

Table 1 shows the mean data and the standard error ; f the mean for all four 
speakers from the Tuller and Kelso (198^; study. 2X2 ANOVAs for each 
utterance type showed no significant main effects of rate or stress or their 
interaction on the phase angle of upper lip onset for medial consonant 
production. For /babab/, Fs (1,27) ranged frari .02 to 2.97; for /bapab/, F3 
0,30) ranged from .01 ""to 2.39; for /bawab/, Fs (1,29) ranged from .01 to 
2.80, jgs > .1. Although phase angle was invariant across speaking rate and 
stress. Table 1 also shows ?*oine differences in phase angle as a function of 
the medial consonant. There ivO some tendency for upper lip phase for /p/ to 
be smaller than /b/. This result may be consistent with acoustic findings 
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Table 1 

Mean upper lip phase (±SE) relative to vowel-to-vowel jai" trajectory for 
subjects JE, NM, BT, and CH 





/baba/ 


/bapa/ 


/h awa/ 


ss» 


212 (7.05) 


18'4 (G.iJI) 


207 f 6Q ) 


su 


205 (2.83) 


183 (2.07) 


205 (4.H3) 


FS 


197 (2.83) 


177 (3.82) 


212 (6.08) 


FU 


203 ('4.26) 


179 (3.8-;) 


203 (6.3'<) 


NM 








bS 


182 (2.0) 


178 (2.35) 


193 (2.19) 


SU 


178 (2.7H) 


175 (3.21) 


193 (2.19) 


FS 


13H (3.1^4) 


176 (1.73) 


T97 (2.:0) 


?u 


183 (3.^49) 


172 (2.50) 


189 (3.93) 


BT 








SS 


168 (2.76) 


163 (2.93) 


196 (2.91) 


SU 


168 v''.83) 


17^* (3.50) 


198 (5.65) 


FS 


166 {H.58) 


166 (2.68) 


192 (5.23) 


FU 


l^iJ (3.:.«) 


167 (3.96) 


191 (4.11) 


CH 








SS 


I8i4 (i<.38) 


188 (6.45) 


203 (5.48) 


SU 


1 86 (3.93) 


18'< (3.01) 


208 (3.67) 


FS 


181 (6.38) 


183 (3.9'<) 


207 (5.90) 


FU 


182 (3.16) 


177 (4.08) 


196 (4.37) 



*SS « Slow (Normal) speaking rate, first syllable str^essed 
SU » Slow (Normal) speaking rate, first syllable unstressed 
FS - Fast Speaking Rate, first syllable stressed 
FU - Fast Speaking Rate, first syllable unstressed 
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that vowels are longer before voiced than voiceless consonants. There is also 
a strong tendency for /w/'s upper lip phase to be greater than the stops. 
However, this result could be art if actual: the movement measures did not 
include a horizontal canponent (potentially larger for /w/ than /b/ or /p/). 
In addition, our subjects produced /w/ with much smaller and slower upper lip 
movements, making measurement of movement onset more difficult. 

3.3 Empirical and Theoretical Implications 

There are at least two empirical advantages of these phase angle analyses 
over our previous relative timing descriptions. First, in the relative timing 
analysis, the overall correlations across rate and stress conditions were very 
high, but the within-condition slopes tended to vary somewhat. In the phase 
analysis, on the other hand, the mean phase angle is the same across 
conditions. Second, recall that the relative timing data were fitted by 
linear functions described by two parameters. The phase description requires 
only a single parameter and, if nothing else, is the more parsimonious 
description. 

The phase angle conceptualization also offers a number of theoretical 
advantages over the original relative timing analyses. First, once 
articulatory motions are represented geometrically on the phase plane, the 
phase angle serves to normalize duration across speaker, stress, speaking 
rate, etc. SeconJ, these analyses potentially provide a grounding for 
so-called intrinsic timing theories of speech production (e.g.. Fowler, 1980; 
Fowler, Rubin, Remez, & Turvey, 1980), since neither absolute nor relative 
durations need be monitored or controlled extrinsically , and no time-keeping 
mechanisms or time controllers are required in this formulation. As with a 
candle (which provides a metric for time by a change in its length ) or a water 
clock (where the metric is number of drops), the units of time are defined 
entirely in terms of the dynamical processes involv«*d. Time itself is not a 
fundamental variable, and is not likely to be a possessed, programmed, or 
represented property of the speech production system (Kelso, Tuller, & Harris, 
198i|; Kelso & Tuller, 1985a). As an aside, it has never been clear how the 
Speech system could keep track of time, at least peripherally, because there 
is no known afferent basis (such as time receptors) for time-keeping in the 
articulatory structures themselves (Kelso, 1978). On the other hand, an 
informational basis (e.g., in position and velocity sensitivities of muscle 
spindle and Joint structures) is a physiological given in the phase angle 
characterization. It might well be the case that certain critical phase 
angles provide information for coordination between articulators (beyond those 
considered here) and/or vocal tract configurations, just as phase angles of 
the leg Joints provide coupling information for loccraotory coordination (Shik 
& Orlovskii, 1965). Third, as Fowler (1980) notes, a dani/iant assumption of 
what she calls extrinsic timing theories of cc^rticulat ion is that 
phonological segments are considered to be discrete .i.r. the sense that their 
boundaries are straight lines perpendicular to the t^ne a;'.is. Yet as is well 
known, discrete segments are not seen by perpendicular cuts of the physical 
records of speech (acoustic, kinematic, physiological measurements) along the 
time axis. In the phase angXe analysis, however, no a priori assumptions are 
made regarding the issue of segmentation per se, and the overlap (or 
coproduction) among gestures is captured in a natural way while still 
preserving a separation between consonantal and vocalic events. 



188 



193 



Kelso et al.: Data and Theory 



A final implication of the view presented here is that "segments" or 
phonological units as typically defined by linguists may not be relevant to 
the speech production system. Rather, phonological units might be profitably 
reconceptualized in terms of characteristic interarticulator phase structures 
(see also Browman & Goldstein, in press, for related notions). Note that the 
phase structure description minimizes the mind/body problem for speech 
production by avoiding the translation step between psychological planning 
units and the physical execution of those units. On the other hand, different 
issues are immediately raised— such as whether there are a restricted number 
of stable phase structures (which one might expect if they are to be tagged 
with linguistic descriptors), and if so, why some configurations appear and 
not others. Experimental ini-oads into these issues can be made in the present 
perspective with a minimum of ad hoc assumptions, and with little resort to a 
priori lingiiistic categories. 

^. Instabilities; Nonequilibrlum Phase Transitions and Phonetic Change 

The phase analysis of simple speech utterances indicated that certain phase 
relations among the articulators remain unaltered across manifold speaker 
characteristics. Such critical phase angles are revealed by the flow of the 
dynamics of the system; they are not externally defined. As Sleigh and Barlow 
(1980) note in their comparative analysis of creatures that use a wide variety 
of propulsive structures for their activities, phase appears to provide 
essential information for stable coupling among the components of the system. 
How, then, do we conceive of the processes underlying change in articulatory 
pattern? What factors mediate the emergence of new (or different) 
spatiotemporal patterns? Such questions are at the heart of a theory of 
pattern generation. Below we offer an interpretation of certain kinds of 
articulatory (and phonetic) change in terms of the non-equilibrium phase 
transitions treated by synergetics (Haken, 1975, 1977; Haken et al., 1985). 
The central aspects of the theoretical model will be introduced briefly using 
an example from hand movements, and an application to speech will follow that 
focuses primarily on the effects of scaling changes in speaking rate, one of 
Stetson's (1 928/1951 ) "great causes of phonetic modification" (p. 67). 
Importantly, the present analysis attests to the further significance of phase 
information — both within the stable and transition regions of the speech 
system's parameter space — in guaranteeing phonetic stability on the one hand 
and promoting phonetic change on the other. Moreover, along with other kinds 
of data (primarily on phonological development) the form of articulatory 
change may help rationalize a particular phonological unit, as a "natural" or 
intrinsically stable unit in the production of speech. 

^.1 A Synergetic Outline : Pattern Formation and Change 

For some years now, we have advocated an approach in which the control and 
coordination of multidegree of freedom speech and limb activities are treated 
in a manner continuous with cooperative phenomena in other physical, chemical, 
and biological systems, that is, as synergetic or dissipative structures 
(e.g., Kelso et al., 1981, 1983; Kelso & Saltzman, 1982; Kugler et al., 1980). 
These are systems — like that for speech production — that are composed of very 
many subsystems. In synergetics, when a certain parameter or combination of 
pvirameters (generally referred to as "controls") is scaled in sometimes quite 
nonspecific ways (i.e., the control prescription is not highly detailed), 
well-defined spatiotemporal patterns can form. The latter are maintained by a 
continuous flux of energy (or matter) through the system (e.g., Haken, 1975; 
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Yates, Marsh, & Iberall, 1972). Although there is pattern formation in the 
nonequilibrium phenomena treated by synergetics (e.g., the hexagonal forms 
produced in the B^nard convection instability, the transition from incoherent 
to coherent light waves in the laser, the oscillating waves and macroscopic 
patterns of various kinds of chemical reaction, etc.), there are, strictly 
speaking, no special mechanisms—like motor programs—that contain or 
represent the pattern before it appears (for further examples see Kelso & 
Tuller, 198^a). 

How pattern formation occurs in these systems can be visualized roughly as 
follows. Imagine an open, dissipative system, one into which energy is 
continuously fed and from which it is continually dissipated. Certain 
configurations—called modes— are more capable of absorbing the energy flow 
than others. At a critical point, a linearized stability analysis reveals 
that the amplitude of these so-called unstable modes grows exponentially, 
whereas the other modes (the so-called "damped" modes) decay. In many 
nonequilibrium systems, close to critical (or bifurcation) points, the number 
of unstable modes can be shown to be much smaller than the number of stable, 
damped modes. In fact, the latter can be completely eliminated 
mathematically, according to Haken's so-called slaving principle, thereby 
allowing a tremendous reduction of the degrees of freedom. For example, in 
the laser (see Haken, 1975), a reduction from 10* degrees of freedom to a 
single degree of freedom has been obtained. 

More formally, the slaving principle states that the amplitudes of the 
damp'2d modes can be expressed by means of a small set of "unstable" mode 
amplitudes (the so-called order parameters). The consequence is that all the 
damped modes follow the order parameters adiabatically , so that the behavior 
of the whole system is then governed by the order parameters alone (see Haken, 
1977/1983, Chapter 7). Watching B^nard convection, for example, one is 
impressed how the total behavior— at a critical point — is completely captured 
by a macroscopic, modal action. The motions of the many microscopic, 
molecular components are completely irrelevant at this point: a low 
dimensional, macroscopic observable (the order parameter) specifies the 
system's evolving pattern. 

However, identifying order parameters, even for many physical and chemical 
systems, is not always an easy matter. Certain guidelines do exist, however, 
which can be used to select viable candidates. A main one is that the order 
parameter changes much more slowly than the subsystems it is said to govern. 
Relative phase^ fits, this criterion quite well, since it is the phasing 
structure of many different activities that is preserved across scalar 
transformations (Section 3). Thus the individual articulatory components 
change quite a bit (kinematically and electromyographically) , but the phase 
does not — at least in a given region of the parameter space. 

^.2 Phase Transitions in Movement: An Explicit Example 

Using relative phase as an order parameter, Haken, et al. (1985) have 
offered an explicit theoretical model of phase transitions that occur in 
bimanual activity (see Kelso 1981, 198^). The basic phenomenon is as follows: 
A human subject is asked to cycle his/her fingers at a preferred frequency 
using an out-of-phase, antisymmetrical motion. That is, flexion [extension] 
of one hand is accompanied by extension [flexion] of the other. Under an 
instruction to increase cycling frequency, that is, a systematic rate 
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increase, the movements shift abruptly to an in-phase, symmetrical mode 
involving activation of homologous muscle groups. When the transition 
frequency was expressed in units of preferred frequency, the resulting 
dimensionless ratio or critical value was constant for all subjects but one 
(who was not naive and who purposely resisted the transition — although with 
certain energetic consequences, see Kelso, 198^). A frictional resistance to 
movement lowered both preferred and transition frequencies, but did not change 

the critical ratio (-1.33), suggesting the presence of an intrinsic invariant 
metric. 



For present purposes, the main features of the bimanual experiments are: 
(1) the presence of only two stable phase (or "attractor") states between the 
hands (see also Yamanishi, Kawato, & Suzuki, 1980, for further evidence); (2) 
an abrupt transition from one attractor state to the other at a critical, 
intrinsically defined frequency; (3) beyond the transition, only one mode (the 
symmetrical one) Is observed; and (^) when the driving frequency is reduced, 
the system does not return to its initially prepared state, that is, it 
remains in the basin of attraction for the symmetrical mode. 

The theoretical strategy employed by Haken et al.. (V985) to account for the 
foregoing findings may worth noting. First, they specified a potential 

function corresponding to the layout of modal attractor states (i.e., the 
stable in-phase and out-of-phase patterns), and showed how that layout was 
altered as a control parameter (driving frequency) was scaled. From the 
behavior of the potential function, they then derived the equations of motion 
for each hand, and a nonlinear coupling structure between the hands. Analytic 
derivations and consequent numerical simulation revealed that if the model 
system was started, or "prepared" in the out-of-phase mode, and driving 
frequency was increased slowly, the system remained in that mode until the 
solution of the coupled equations of motion became unstable. At this point, a 
jump occurred and the only stable stationary solution produced by the system 
corresponded to the in-phase mode (see Haken et al., 1985, for more details). 
Ongoing theoretical (SchOner, Haken, & Kelso, 1986) and empirical (Kelso & 
Scholz, 1985) work has revealed that the nonlinear coupling strength as well 
as fluctuations (both intrinsically generated due to noise in system 
parameters and extrinsically generated due to an added random forcing 
function) play an important role in effecting the modal transitions between 
the hands. Thus, Kelso and Scholz (1985), in new experiments, have found both 
"critical slowing down" and enhanced f lucti.ations in order parameter behavior 
as the transition is approached. These predictions follow directly from the 
synergetic treatment of nonequilibr: uin phase transitions (see e.g., Haken, 
198^; Haken et al., 1985; SchOner, Haken, & Kelso, 1986) and are simply not 
part of more conventional accounts of "switching" behavior based on motor 
programs (cf . Schmidt , 1 982 , p. 31 6) or central pattern generators 
(cf. Grillner, 1 982, p. 22^^). 



^.3 Stetson's (1951) Experiments 
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Let us now see how this view of spatiotemporal pattern formation and change 
may apply to speech production. To do this we draw initially on Stetson's 
(1951) work and offer a theoretical interpretation of his experiments that is 
consistent with synergetics. Then we mention some new (as yet preliminary) 
data of our own (Kelso, Munhall, Tuller, & Saltzman, in preparation) 
suggesting that certain kinds of phonetic change correspond directly to phase 
transitions among articulatory gestures. 
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Stetson (1951) recognized that "...the modification of the articulations is 
one of the most important aspects for study in experimental phonetics" 
(p. 67), and that scaling changes in speaking rate offered a window into the 
"various types of modification of the factors of the syllable, or the changing 
conditions that throw them together or force them apart" (p. 67). We also 
hypothesize — by analogy to our discussions above — that rate changes may: a) 
reveal the most stable modes of coordination of the articulatory system and, 
in turn b) that these stable modes may rationalize why one phonological form, 
the CV syllable, tends to be a universal feature of all languages 
(cf. Abercrombie, 1967; Bell, 1971; Clements & Keyser, 1983). 

Consider first an example, discussed in some detail by Stetson (1951). A 
subject produces the CVC syllable "pup" repetitively. As speaking rate is 
gradually increased^ Stetson describes the following changes: The syllables, 
"pup, pup..." at first distinct, come closer together. As rate increases, 
the arresting consonant of each syllable "doubles" with the releasing 
consonant of the next syllable^ Thus the first change can be annotated as: 
"pup, pup".... "pup-^pup. . , " At still higher rates, according to Stetson, it 
becomes impossible to execute the prescribed number of consonants per second, 
and the arresting consonant of each syllable drops out. This second change 
can be referred to as "singling"; "pup-pup..." goes to "pu* pu '...." 

Such changes induced by increasing speaking rate are brought about, in 
Stetson's words, by the tendency of movements "either to get into step or to 
dfop out in order to simplify the coordination" (p. 71) and, relatedly, 
because of a "universal tendency to simplify by eliminating the arresting 
consonant" (p. 8l ) . But why this particular tendency should prevail is 
unclear. On the one hand, elimination of the arresting consonant "simplifies" 
coordination; on the other, the process is dictated by maximum articulatory 
rates: "singling" must occur at rates of around ^.5 syllables per sec, 
because such rates in turn entail eight consonant movements per sec (Stetson, 
1951). 

"Simplification" as a function of maximum articulatory rate cannot be the 
whole story, of course. For example, often "singling" occurs at a rate as low 
as 2.5 syllables/sec. Also, the arresting consonant does not always drop out; 
often it is said to "fuse" with the releasing consonant (Stetson, 1951). 
Therein lies a potential clue. That is, it may be the phasing among component 
gestures — one with another — that is a central aspect of phonetic stability and 
change (cf. Section 3.0). For example, in his studies of the combination of 
abutting consonants. Stetson notes that the arresting consonant of one 
syllable (e.g., "p" in "sap") and the releasing consonant of the next (i.e., 

"s" in "sap"), "quickly overlap and soon become simultaneous a striking 

illustration of the movements of speech to get in phase" (p. 78). Moreover, 
as rate decreases, the movements of arresting and releasing consonants "merely 
slide apart" (p. 78). Thus, there is a strong hint in Stetson's experiments 
and writings that undfer scaling influences of speaking rate, certain phase 
relations among gesture.^ are more stable than others. For example, in all 
"phonetic coordinations'"' to use Stetson's phrase, there is a preferred 
relation between the releasing consonant and the beat stroke of the syllable. 
Namely, the releasing consonant never drops out. According to Stetson (1951), 
it retains its position because it coincides with the syllable's beat 
stroke.^ In addition, compound consonants are said to be produced by the 
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"sliding" of the two movements, e.g., the continuant labial ^n^' and the 
continuant lingual "s" in the syllable "mass" slide to form "sma." Stetson's 
descriptive language in this respect is almost prophetic of current 
formalisms: abutting consonants are "attracted" (p. 80) one to the other. As 
part of the tendency for movements to coinc^'de^ one consonant movement is 
delayed and the other advanced (oi*. Stetson, 1951, p. 80). 

Though phase was never explicitly measured in Stetson's work, and his 
acoount of phonetic change is largely posed within his "chest pulse" 
framework, there appears to be a strong linkage between his results and our 
previous discussion of hand movements. In particular, it seems possible that 
both may fall under "heoretical rubric of nonequilibrium phase 

transitions. Such a view is supported on at least two grounds. First, in an 
as yet quite limited data set, we have examined interarticulator phasing when 
subjects produced the vowel-consonant combination Zip/ at progressively faster 
rates (Kelso, et al., in preparation). A shift to the CV form /pi/ occurred 
at a given rate and was characterized by an abrupt change in the phase 
relation between glottal aperture and lip aperture. Second, it seems possible 
to reinterpret some of Stetson's own data on syllable durations when speaking 
rate is increased, as consistent with our previous theoretical discussion of 
phase transitions. 

Phase Transitions in Speech : Some Direct Evidence 

The design of the following experiment was extremely simple. Infrared 
light emitting diodes were placed on the subjects' lips and jaw, thus allowing 
us to obtain the trajectorief^ of these articulators. Similarly, the opening 
and closing of the glottis was monitored by transillumination (e.g., Baer, 
Lttfqvist, & McGarr, 1983). Similar to some of Stetson's (1951) work, the 
subject was invited to produce the syllable /ip/ at a slow speaking rate and 
then instructed simply to speed up in a step-like manner. A complete trial 
consisted of a series of repetitive syllables produced in a single breath. 
Typically, a trial lasted about 10 to 12 sec. An identical procedure was 
employed for the syllable /pi/. Subjects performed at least five trials per 
syllable. Although data collection and analysis are not yet completed 
(presently three subjects have been run), the data are quite clear thus far. 

Trajectories over time of lip aperture (i.e., a single variable 
representing vertical distance between upper and lower lips® and glottal 
aperture are shown In Figure 7(1b) and 7(2b) for part of a representative 
trial for each utterance. In these subfigures, the vertical ticks denote the 
onset of lip opening (consonant release) and the occurrence of peak glottal 
opening (maximum vocal fold abduction). The corresponding relative phase 
between the lip and glottal aperture motions is shown in Figure 7(1 a) and 
7(2a). The movements shown were sampled at 200 Hz (for details of signal 
processing techniques, see Kay, Munhall, V.-Bateson, & Kelso, 1985). 

In the case of both /ip/ and /pi/ it is quite obvious that the phase 
relation between lip opening onset and peak glottal opening is practically 
invariant, but different for the two syllables, over the range of speaking 
rates examined (approximately 1 to 5 syllables/sec). For /pi/, peak glottal 
opening lags the onset of oral opening by a constant amount, roughly ^0-^5 
degrees. For /ip/ the two events are almost coincident, up to a speaking rate 
of approximately H syllables/sec. Then, a clear jump in phase occurs, 
practically within a single cycle, to the phasing pattern for /pi/. Note that 
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like the hand movement data, the phase transition occurs at well below maximum 
syllable rates, at least for CV syllables. Again, both forms of coordination 
are quite stable below the critical region: only the coordinative mode 
characteristic of the CV, however, exists beyond the transition. Except for 
the quantitatively different phase relations observed, these speech data mimic 
the pattern of results observed in the bimanual data. 
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Figure 7. Phase relation between glottal and oral aperture (A), and 
trajectories of each variable over time (B) as the subject speeds 
up a given utterance. 1. The instructed utterance is Zip/. The 
arrow denotes a phase shift between glottal and oral aperture that 
occurs when the VC form Zip/ changes to the CV form, /pi/. 2. The 
instructed utterance is /pi/. Note that no phase shift occurs in 
this case even though the speaking rate is comparable to (1). See 

text for further details. ( ) glottal aperture; ( ) lip 

aperture. 

Several issues remain to be addressed, however. Firfit, we need to know 
much more about what goes on in the region of the trani:ition itself. Second, 
a continuous estimate of relative phase should be obtained (see Kelso & 
Scholz, 1985). The point estimate presented here requires the rather 
arbitrary selection of peaks or valleys in the time-series data as reference 
and/or target events. Given previous work on laryngeal-oral coordination 
(e.g., LOfqvist & Yoshioka, 1981), the selection of the present events (i.e., 
lip opening onset and peak glottal opening) seems reasonable. Obviously 
different events, for example, the onset of movement toward oral closiire 
relative to peak glottal opening, would yield the same pattern but different 
phase values. A continuous estimate, based on a sample-by-sample phase 
difference, would not require one to make such a choice a priori. Third, 
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Stetson (1951) reports that the original form restores with a decrease in 
rate, that is, the VC form returns and that "..•this tendency to restoration 
... is the great conservative factor in pronunciation" (p. 7^). We suspect 
otherwise, though we have yet to check our suspicions formally. That is, once 
a transition to the CV form occurs, the system exhibits hysteresis~j t tends 
to remain in the currently displayed form.^ If this is so, then we have a 
model of phase transitions in sper?ch that is formally equivalent to that of 
Haken, Kelso, and Bunz (1985) developed for phase transitions in hand 
movements. 

^•5. Theory and Theoretical Implications 

A further hint that the discontinuities created by rate scaling are at 
least consistent with a nonequilibrium phase transition interpretation can be 
gleaned from Stetson (1951, Figure 51). In this figure, whose main features 
are reproduced here (Figure 8a), distribution curves are presented showing the 
rates at which "dou^;ing" and "singling" occur when a single syllable is 
repeated at varying rates. Althougli "doubling" occurs at rates between 1 and 
3.5 syllables per sec, a peak for doubling is present at 2.5 syllables/sec. 
Another way to envisage these data is to invert the curves: Two minima are 
then apr^arent, one for each of the two articulatory patterns, separated by d 
local hill or maximum (Figure 8b). As speaking rate is increased, it becomes 
increasingly difficult (as indicated by prog.»essively fewer and fewer 
observations) for a subject to maintain "doubling," Then, at a critical 
point, a shift to the next "equilibrium" configuration occurs, corresponding 
to the •^singling" pattern. Analogous to our discussion above, it seems 
plausible to suggest that: a) the doubling and singling patterns correspond 
to distinct system modes, each characterized by specific phasing relations 
among articulatory gestures; and b) the transition from doubling to singling 
beyond a critical production rate reflects a system bifurcation. 




Sylloble frequency (s~'} 



Figure 8. A. Distribution curves (probability density functions) showing the 
rates at which "doubling" and "singling" occur when a single 
syllable is. repeated at varying syllable frequencies (adapted fran 
Stetson, 1951, Figure 51, p. 69). B. Corresponding "potential 
functions" for tae probability distributions shown in A. See text 
for further details. 
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Figure 9. Relationship between oxygen consumption and locomotory speed during 
the walk, trot, and gallop of ponies (adapted from Hoyt & Taylor, 
1981). See text for further details. 

These speech data of Stetson, therefore, bear a striking resemblance to our 
speech and hand movement data as well as recent work on locomotor gait 
transitions (see Figure 9, reproduced from Hoyt & Taylor, 1981),^° an 
interpretation of which is given in Kelso and Tuller (198^a) and Kelso and 
Scholz (1985). In the case of quadruped gait, the modes correspond to 
particular phasing relations among the limbs, which, when the animal is 
allowed to locomote freely, correspond to regions of minimum oxygen 
consumption. Hoyt and Taylor (1981), however, forced ponies to locomote away 
from these stability regions by increasing the speed of a treadmill on which 
the ponies walked. That is, according to our interpretation, they 
experimentally displaced the ponies away from equilibrium. As locomotor 
velocity is scaled, it becomes metabolically costly for the animal to maintain 
a given interlimb configuration; a switch into the next stable region, that 
is, the next local minimum, occurs (e.g., walking shif ts to trotting). Like 
the hand movement d-ta, when a critical value is reached (a point at which the 
"forces" driving through the system—roughly equated with increases in neural 
activation of muscle groups induced by rate scaling—compete with, and 
overcome the "forces" holding the system together, i.e., characteristic phase 
relations), the system bifurcates and a new (or different) spatiotemporal 
ordering emerges. We want to emphasize that such ordering changes are not 
strictly fixed for any of these situations. Horses, for example, can trot at 
speeds at which they normally gallop (as a visit to Yonkers* race track to 
observe the trotters will quickly reveal), but it is metabolically expensive 
to do so. Similarly "doubling" is possible beyond the bifurcation point in 
Figure 8, as illustrated by the dashed line, but there are so few observations 
there (at rates between 3 and ^ syllables/sec), as to suggest that 
coordination in that region is highly unstable. 
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In summary, although there are obvious differences between the various 
critical phenomena discussed here, there is reason to suppose that all of 
them — hand movement, speech, and gait — correspond to instabilities that arise 
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as the particular system is driven experimentally away fron equilibrium." 
Obviously, much more work needs to be done to ground this conjecture '(see 
Kelso & Scholz, 1985; Schttner et al., 1986, for possible experimental 
directions). In each case, new stabilities arise— indexed by particular phase 
relations between the components— as a result of competition between energy 
flowing into the operational components (i.e., a scaling influence) and the 
ability of those components to absorb the energy flow in their new 
configuration. In the hand movement case and, by hypothesis, in speech as 
well, we expect that higher bifurcations are possible because the system has 
available additional degrees of freedom (see Kelso & Scholz, 1985). That is, 
more configurations are possible— some of which will be stable and others 
not— precisely because of the availability of these extra degrees of freedom. 
In this view, the latter are not a curse (of. Bellman, 196I) but a tremendous 
advantage. In addition, fluctuations (the "noise" often removed by engineers) 
can be shown to permit the discovery of new modes or phasing structures 
(cf. ScheJner et al., 1986). 

The present theoretical perspective not only affords tlie potential for a 
principled analysis of pattern forraation (for many more details, see Haken et 
al., 1985), but, as we mentioned earlier, the nature of the pattern change 
itself may also prove rather informative. The theory predicts that the states 
that emerge under scaling influences are the most favored ones, and empirical 
evidence supports this interpretation. Thus, within the range of driving 
frequencies examinea in the experiments of Kelso and colleagues, shifts to the 
symmetrical mode of coordination occur, but not vice-versa. Similarly, in 
speech, the articulatory configuration supporting the consonant-vowel form is 
the more fundamental: utterances never shift to the VC form under rate 
increases wiien the system is originally prepared in the "CV state." In each 
case, one structure can be fractured, but not the other. By the arguments and 
data discussed here, this is because certain phase relations among the 
articulators— which can be modeled as an order paramet<ir for the total 
articulatory ensemble— are more stable than others. 

Clearly both CV and VC forms (like symmetric and antisymmetric hand 
movements) can be produced easily in a given region of parameter space. The 
question of which of the two patterns is more basic, is answered by 
determining which remains beyond a critical point. The fact that the 
consonant -vowel form "wins out" when the system is scaled is thus a 
consequence of the stability of the articulatory configuration for that form. 
That is, certain configurations can absorb the energy input more efficiently 
than others. The universal tendency (Stetson, 1951) to simplify coordination 
by eliminating the arresting consonant (i.e., the one tied to the previous 
vowel), suggests that it is in some sense "easier" for the system to produce 
movements "in-phase," than otherwise. However, this does not have to be the 
case according to the present thesis. It remains very much an open 
question— to be pursued empirically— as to which phase relations are more 
stable than others. In the case of speech, unlike the hand movement case (at 
least in the most primitive, paradigmatic case studied in our experiments), we 
would expect a -nuch larger and more varied (but perhaps nested) set of stable 
phasings. A similar hypothesis applies to studies of phasing in skilled 
pianists, which we are presently analyzing. In each case, the layout of the 
attractor states should be much more "wrinkled" than the "simple" bistable 
potential— differentiating in-phase and antiphase modes— that we have studied 
thus far. 
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In spite of the foregoing caveats, the present analysis may rationalize, in 
an elegant fashion, why the consonant-vowel is a core syllable type in all 
languages (cf. Abercroro^ie, 1967; Clements & Keyser, 1983). Such a rationale 
has been missing in much phonological theory, which starts off with the CV 
core unit as a basic assumption . Moreover, the developmental evidence 
reviewed by Locke (1983) reveals a strong tendency for consonant initial (CV) 
forms to predominate in infant babbling. Rate scaling studies may reveal 
these primitive forms of coordination in the mature organism and therefore 
offer a window into the building blocks of language, a principled 
decomposition of which has been lacking. Like the particle accelerator that 
breaks atoms apart to reveal their secrets, so forcing the artlculatory system 
to perform at unusual rates may reveal the primitive units of language, and, 
more important, their interactions with other units. Lest this image be 
interpreted as too mechanical or immutable, let us allay the reader's concern; 
nothing could be further from our intent. Just as the cheetah does not have 
to proceed through the loccmotory gaits when it pursues its prey, so the 
system that realizes language does not have to traverse through any fixed cet 
of phase relations to reveal its intent. What this experimental program may 
reveal, and this theoretical framework rationalize, is a design that allows 
for, rather exploits, the low energy switching among its artlculatory 
configurations — in short, a design appropriate for intentional systems. 

5. Summary 

Presented here, in preliminary form, is a general theoretical framework 
that seeks to characterize the lawful regularities in artlculatory pattern 
that occur when people speak. A fundamental construct of the framework is the 
coordinative structure , an ensemble of articulators that functions 
cooperatively as a single task-specific unit. Direct evidence for 
coordinative structures in speech is presented and a control scheme that 
realizes both the contextually-vary ing and invariant character of their 
operation is outlined. Importantly, the space-time behavior of a given 
artlculatory gesture is viewed as the outcome of the system's dynamic 
parameterization, and the orchestration among gestures is captured in terms of 
intergestural phase information. Thus, both time and timing are deemed to be 
intrinsic consequences of the system's dynamical organization. The 
implications of this analysis for certain theoretical issues in coarticulation 
raised by Fowler (1980) receive a speculative, but empirically testable, 
treatment. Building on the existence of phase stabilities in speech and other 
biologically significant activities, we also offer an account of change in 
artlculatory patterns that is based on the nonequilibrium phase transitions 
treated by the field of synergetics. Rate scaling studies in speech and 
bimanual activities are shown to be consistent with a synergetic 
interpretation and suggest a principled decanposition of languages. The CV 
syllable, for example, is observed to represent a stable artlculatory 
configuration in space-time, a possible rationalization for the presence of 
the CV as a phonological form in all languages. The uniqueness of the present 
scheme is that stability and change of speech action patterns are seen as 
different manifestations of the S-'ame underlying dynamical principles — the 
phenomenon observed depends on which region of ♦'he parameter space the system 
occupies. Though probably wrong, ambitious, and the outcome of much idle 
speculation, the simplicity of the present scheme is attractive and may offer 
certain unifying the^ae^^ for the traditionally disparate disciplines of 
linguistics, phonetiC5,v and speech motor control. 
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Footnotes 

^Note that this claim, to be supported here, seems to run counter to the 
notion that normal phoneme rates "can be achieved only if separate parts of 
the articulatory machinery—muscles of the lips, tongue, velum, etc. —can be 
separately controlled" (Liberman, Cooper, Shankweiler, & Scuddert-Kennedy, 
1967). 
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*Wlth Implications, no doubt, for speech perception, a topic that is, 
however beyond the mission of this paper (but see, e.g., Hberman & Mattingly. 
1985 ). 

*A preliminary report of these data was given by Kelso and Tuller (1985b). 

**Note that there is an important caveat for the phase notion. Though 
phase has been illustrated here at a very "simple" inter arti culator level, we 
do not want to suggest that this is necessarily the appropriate frame of 
reference for speech production and perception. However, the point is that 
regardless of the particular frame of reference (e.g., events defined at 
muscle, articulator, tract variable levels, etc.), a concept such as phase 
will be crucial to specifying the sequence of events. 

'The conputation of the phase angle of a point on the phase plane is 
problematical. Because the units of position and velocity are incommensurate 
(ran vs. ran/sec, for example), applying inverse trigonometric functions 
directly to the data yields meaningless results. To avoid this problem, we 
normalize both position and velocity to the same numerical interval, -1 to +1 
(not necessarily the unit circle for periodic data), and then apply the 
inverse tangent function to the normalized data. The normalization of 
position over a cycle of data proceeds via the following linear transform: 

''norm " 2P/(Pn,ax-Pmin) ' (^max^^min )/(Pmax-Pmin) » 

where ^^^^^ is the normalized position, P is the actual position, and 

^max ^min maximum and minimum position values over a 

cycle. 

This has the effect of (i) rescaling the data to a range of 2 units and (ii) 
shifting the equilibrium position to zero, i.e., the interval -1 to +1 is 
achieved. Velocity is normalized according to which half-cycle the point of 
interest is in, to put the half-cycle of articulator raising on the interval 0 
to 1 and the half-cycle of articulator lowering on the interval 0 to -1. That 
is, 

norm - V/ab.UVn,ax), (2) 
where V^^^^ ig the normalized velocity, V is the actual velocity, and 
\ax maximum velocity during the corresponding half-cycle. 

The arctangent was then computed using the normalized position and velocity 
values , 

phase angle - arctan(V^^^^/p^^^^) (3) 

The final value obtained is a number from 0 to 360 degrees, which increases in 
value in the direction opposite to the unwinding of trajectories on the phase 
plane, as per mathematical convention. 
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key feature in the development of science has been to define limits or 
constraints on natural phenonena. Once such constraints are known, much new 
understanding results (Prigogine & Stengers, 198^). Phase has this 
constraint-like property. Our experiments described in Section ^ reveal the 
li.Tiics over which one organization (a given phase relation) can remain stable. 
Also, because in those experiments, it is phase (and phase alone, as far as we 
know) that changes dramatically, we have some reason to suppose that phase is 
a key parameter even in the stable range of performance, that is, that phase 
represents a fundamental constraint (see Section 3). 

^The beat stroke as defined by Stetson (1951), is "always ballistic ... 
and can hardly be longer than ms" (p. 29). He continues "The unit 

movement of speech is the pulse which produces the syllable, a pulse of air 
through the glottis made audible by the vocal folds in speaking aloud and 
stopped and started by the chest muscles or by the auxiliary movements of the 
consonants" (p. 30). And later on in a discussion of consonant release. 
Stetson indicates that "... the stroke of the expiratory chest muscles and 
the beat stroke of the consonant occur at the same time" (p. ^6). We include 
this definition and clarification of "beat stroke" for mostly historical 
reasons. Some of Stetson's claims about syllable pulses have been seriously 
questioned (Ladefoged, Draper, & Whitteridge, 1 958 ). The present analysis, of 
course, does not rely on such notions. 

®Lip aperture vas estimated simply by subtracting the position of the 
lower lip from that of the upper lip. Note, however, that the same data 
pattern was obtained when the movement of a single labial articulator (e.g., 
the lower lip) was compared to glottal aperture. 

^Stetson (1951) says little about the instructions given to subjects when 
speaking rate is reduced. Obviously, with slowing they will be able to say 
Zip/ below a certain rate. The question is when they do so spontaneously if, 
for example, they could not hear the consequences of their production. 

^°This resemblance is not only qualitative but perhaps quantitative as 
well. It may be pure serendipity that the ratio of the "doubling" mode 
frequency i^2.5 syllables /sec) to the critical frequency (-3.1 syllables/sec) , 
shown in Figure 8, bears a close correspondence {^u2^) to the dimensionless 
ratios computed for Kelso's bimanual (^1.31) and Hoyt and Taylor's gait 
^-1.33) data (see Kelso, 198^). These dimensionless numbers, analogous to 
Reynolds* numbers in fluid dynamics, may be a reflection of the system's 
intrinsic "distance fran equilibrium." That is, they may index how far beyond 
a "preferred" steady state a given pattern can persist before it fractures 
into a new configuration. 

'^One difference, for example, between the hand and speech data and the 
gait analysis is that, in the former, the various modal patterns can coexist 
in stable forms at subcritical rates. Galloping, on the other hand, is not 
observed at slow walking speeds. Though it may be available, it is simply not 
a stable locanotor mode in that region of the parameter space (see Figure 9). 
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THE VELOTRACE: A DEVICE FOR MONITORING VELAR POSITION* 
Satoshi Horiguchit and Fredericka Bel.l-Bertitt 



Abstract , This paper describes the Velotrace, a mechanical device 
desi»^ned to allow the collection of analog data on velar position. 
The device consists of two levers connected through a push-rod and 
carried on a pair of thin support rods. The device is positioned in 
the nasal passage with the internal lever resting on the nasal 
surface of the velum and the external lever positioned outside the 
nose. The movements of the external lever reflect the movement of 
the internal lever as it follows velar movement and are recorded as 
an analog signal using an optoelectronic pos'.tion-sensing system. 
Results of evaluation studies indicate that the Velotrace accurately 
reflects the relatively rapid movements of the velum during speech. 

Introduction 

Since the size of the velar port determines the oral or nasal nature of 
speech sounds, there has long been interest in studying the velopharyngeal 
region (see Fritzell, 1969, for an extensive historical review). The various 
techniques used to study the velopharyngeal mechanism have examined a number 
of its dimensions, one result o-T which is the recognition that the size of the 
open velar port is reflected in the position of the velum, although velar 
position may also vary when the port is completely closed (see, for example, 
Henderson, 198^^; Moll & Daniloff, 1971). These adjustments of velar position 
above the level at which closure occurs result from the anatomical 
relationship between the velum and the levator veli palatini (LVP) muscle. 
That is, since the superior attachjnent of the LVP muscle lies well above the 
level at which port closure is complete, increasing contraction of that muscle 
will continue to raise the velum even after the velopharyngeal port has been 
closed. As a result changes in the vertical position of the velum throughout 
its range of movement inay be considered to reflect speech motor control of the 
velum, and have the additional benefit of not suffering from a boundary effect 
in the way that velar port size measures do when port closure is achieved 
(see, for example, Bell-Berti, 1980). Thus, monitoring changes in the 
vertical position of the velum should allow the discovery of the principles of 
the (normal) velar motor control, which should increase our understanding of 
speech production, in general, and also increase our ability to evaluate velar 



^Versions of the paper were presented at the 1984 meeting of the American 
Cleft Palate Association (Seattle, WA, May 1984) and of the American 
Speech-Language-Hearing Association (San Francisco, CA, November 1984). 
tAlso Research Institute of Logopedics and Phoniatrics, Faculty of Medicine, 
University of Tokyo, Tokyo, Japan 
ttAlso St. John's University. 
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control problems in some clinical populations. Thus, and continuing in the 
tradition of mid-sagittal monitoring of velar function, we have developed a 
new mechanical device, the Velotrace, that allows the collection of data on 
velar position in analog form and eliminates the need for X-ray exposure and 
for frame-by-frame measurements of cine and video recordings. 

The Device 

The Velotrace (Figure 1) has three major parts: an internal iever, an 
external lever, and a push-rod between them. The push--rod and levers are 
carried on a pair of thin support rods. The levers are so connected to the 
push-rod that when the internal lever is raised, the external lever is 
deflected toward the subject. The device is loaded with a small soring that 
improves its frequency response and thus improves the ability of the internal 
lever to follow rapid downward movements of the velum. The effective length 
of the internal lever is 30mm (i.e., the linear distance between the fulcrum 
and tip), that of the external lever is 60mm, and that of the push-rod 
assembly is 150mm. The height of the device is ^mm and its width is 3mm, 
making it nc larger than many commonly used nasopharyngeal fiberoptic 
endoscopes . 




Figure 1. Schematic diagrsm of the Velotrace. 

ie device is positioned after topical anesthetic and decongestants have 
bee: applied to the nasal mucosa, if necessary, and the posterior pharyngeal 
walj as become visible through the nasal passage; the Velotrace is inserted 
usin£ ' procedure similar to that used for nasal catheterization. Although 
the Velc race is a rigid device (unlike fiberoptic endoscopes), the insertion 
is easy unless the subject has serious pathologies or deformities in the nasal 
passage (e.g., substantial deviation of the nar,al septum, nasal polyps, etc.). 

208 



» 212 
ERIC "^^^ 



Horiguchi & Bell-Berti: Velotrace 



None of our four subjects (three males and one female) for the evaluation 
study has complained of any discomfort from the device. 

The fulcrum of the internal lever, of the Velotrace is positioned at the 
end of the hard palate, with the internal lever resting on the velum and the 
support rods resting on the floor of the nasal cavity (Figure 2). An external 
clamp, which is attached to a headbcnd positioned on the subject's head, is 
used to stabilize the position of the Velotrace against his/her head during 
recording sessions. 




ERIC 



Figure 2. A mid-sagittal schematic drawing of the Velotrace in position with 
the internal lever resting on the velum. 

Thp Recording System 

Monitoring the movements of the external lever can be accomplished in a 
number of different ways. For example, one might use a velocity-displacement 
transducer, which would make the Velotrace a convenient stand-alone device for 
the clinical evaluation of velar movement. Another approach would be to use 
an optoelectronic tracking system, such as the one we have been using, to 
monitor the movements of the external lever using infrared Light Emitting 
Diodes (LEDs) attached to the Velotrace. In our system, one LED is attached 
to the end of the external lever and allows us to monitor the movement of the 
lever about its fulcrum. A second LED is positioned at the fulcrum of the 
external lever and serves as a reference point against which the movements of 
the end of the external lever can be described. The positions of the LEDs are 
tracked in two-dimensional space. The acoustic speech signal and a timing 
signal are recorded simultaneously with the LED-position signals on a 
multi-channel instrumentation data recorder. The position signals may also be 
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monitor2d with an oscilloscope in real time. The -jata acquisition system is 
represented in Figure 3- 



The experimental utterance set was composed of three groups of 
disyllables (Table 1). The eight items of Utterance Group 1 each contained a 
medial oral-nasal consonant contrast that was used to insure that maximum 
stress was placed on the velar lowering mechanism (because the nasal consonant 
immediately follows a very strongly oral articulation). These utterances 
allowed us to examine the ability of the Velotrace to follow very rapid 
downward movements of the velum. Conversely, the eight items of Utterance 
Group 2 each contained a medial nasal-oral consonant contrast, used to insure 
that maximum stress was placed on the velar raising mechanism (because a very 
strongly oral articulation immediately follows a nasal one\ These utterances 
allowed us to examine the ability of the Velotrace to follow very rapid upward 
movements of the velum. The six items of Utterance Group 3 contained high and 
low vowel contrasts with medial oral consonant sequences of varying length, 
and allowed us to examine the ability of the Velotrace to reflect the smaller 
velar excursions of entirely oral speech. All of the utterances used also had 
the advantage of having been used in soii^e of our previous vork, thus providing 
the opportunity of comparing the Velotrace data, albeit for different 
subjects, with endoscopically recorded data. 

In the first evaluation study, Velotrace data were compared with 
previously collected endoscopic data. The endoscopic data used for comparison 
with the Velotrace data were obtained from two experiments in which 
frame-by-frame measurements of velar position were made of cine films 
photographed through a nasally positioned fiberoptic endoscope (Bell-Berti, 
1980; Bell-Berti, Baer, Harris, & Niimi, 1979). The subject for the 
endoscopic studies was a speaker of educated Greater- Metropolitan New York 
City English. In those experiments a long thin plastic strip with grid 
markings was inserted into the subject's nostril and placed along the floor of 
the nose and over the nasal surface of the velum, to enhance the contrast 
between the edge of ine supravelar surface and the posterior phciryngeal wall. 
Then a flexible fiberoptic endoscope was inserted into the subject's nostril, 
and positioned so that it rested on the floor of the nasal cavity with its 
objective lens at the posterior border of the hard palate, providing a view of 
the velum and lateral pharyngeal walls from the level of the hard palate to 
above the maximum elevation of the velum (observed during blowing). Cine 
films were taken through the endoscope at 60 frames/sec. The position of the 
high point of the velum was then tracked, frame-by-frame, with the al'l of a 
small laboratory computer. 

The subject for the first Velotrace experiment was a normal speaker of 
educated Middle Atlantic American English who produced between 7 and 12 
repetitions of each of the 22 experimental utterances. The 16 disyllables of 
Groups 1 and 2 were produced in isolation. The six disyllables of Group 3 
were produced in the carrier phrase "It's a (test word) again." Within each 
group, the utterances were read from randomized lists. The speech acoustic 
signal and the positions of the LEDs were recorded simultaneously using the 
system described above. The speech acoustic signal was subsequently dig tized 
at 10,000 samples/sec and the Velotrace signals were digitised at 200 
samples/sec. 



Evaluation Studies 
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Utterance Group 1 
(oral-nasal contrast) 
/fipmip/ 
/fapmap/ 
/fibmip/ 
/fabmap/ 
/fismip/ 
/fasmap/ 
/fizmip/ 
/fazmap/ 



Table 1 

Experimental Utterance List 
Utterance Group 2 
(nasal'oral contrast) 
/fimpip/ 

/fompap/ 
/fimbip/ 
/fombap/ 
/fimsip/ 
/faiusap/ 
/fimzip/ 
/famzap/ 



Utterance Group 3 
(vowel contrast) 
/flijap/ 
/flitsap/ 
/fliststap/ 
/kasiz/ 
/katsiz/ 
/kaststiz/ 
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An acoustic event identififd in the waveform of each token of each 
utterance type served as a reference point for that token in subsequent data 
analysis. The choice of acoustic reference point depended upon the p^-onetic 
structure of the utterance/ These reference points have two 
functions: First, they allow us to examine the physiological signals for 
repetitions of an utterance type with reference to the same acrjstic event. 
Second, they provide a reference point for aligning tokens of an utterance for 
calculating an ensemble average of the signals for the repetitions of an 
utterance type. 

The endoscopically collected velar position data had been reduced to 
ensemble averages and their standard deviations (Eell-Berti, 1980; Bell-Berti 
et al., 1979). Since the individual token daia were no longer available, it 
was necessary to calculate the equivalent ensemble averages for the Veiotrace 
data. However, before comparing ensemble averages of the Veiotrace data with 
the endoscopic data, we examined the Veiotrace token data and averaged data. 
Samples of both ensemble-average and individual token Veiotrace data are shown 
in Figure The velar movement patterns recorded with the Veiotrace are very 
similar for the tokens of each utterance type, and the individual tokens are 
also strikingly similar to the ensemble averages. Thus, we conclude that the 
ensemble averages are representative of the constituent token data, and may be 
used for comparison of Veiotrace data with existing ensemble-average 
endoscopic data. 




Figure ^. Ensemble-average (above) and individual token (below) Veiotrace 
data for one utterance type. Zero on the abscissa identifies the 
reference point for aligning tokens of an utterance type for 
computer sampling and averaging. 
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Figure 5 displays ensemble averages of two different test words, (one 
each from Groups 1 and 2) recorded with an endoscope* for one subject and with 
the Velotrace for the other subject. It is clear that the ensemble-averaged 
Velotrace data display the oame patterns as do the frame- by -frame measurement 
data obtained from the cine films, although the s^.bject, speech rate, and 
duration of the individual speech sounds a»e different. We also observe 
strikingly similar patterns for endoscopic and Velotrace data in which the 
test words were embedded in a carrier phrase (Figure 6). (See Bell-Berti , 
198C, for a description of the experimental design and results of the second 
endoscopic study.) 

For the second evaluation study, cine-radiographic films were taken of a 
third subject J also a speaker of educated Greater Metropolitan New York City 
English. The experimental utterances were two tokens each of a subset of six 
of the utterances used by Bell-Berti et al. (1S79) and in the first evaluation 
study. For this experiment, the Velotrace was positioned in the subject's 
nasal cavity, with the internal lever resting on the nasal surface of the 
velum. A thin gold chain was inserted through the other nasal passage and 
positioned along the velum and into the oropharynx to improve visualization of 
the nasal surface of the velum in the X-ray images.^ The films were taken at 
60 frames/sec. 

Figure 7 represents the film frame image ^ with the measurement poir.ts 
indicated with numbers on the figure. We measured the position of the tip of 
the internal lever of the Velotrace (1), the point on the velum that would be 
tracked by the Velotrace (2), the Velotrace internal fulcrum (3)i and two 
reference points: an upper molar (^) and a lead pellet on the upper incisor 
(5). Visual inspection of the data on the vertical position of the Velotrace 
lever and of the velum (see Figure 8) suggests that movements of the Ve. otrace 
clearly reflect the movements of the velum itself. In order to quantify the 
relationship between these measures, we calculated the correlation coefficient 
between our measures for each of the twelve tokens. The very high linear 
correlation between these two measurements is reflected in scatterplots of the 
data (e.g.. Figure 9) and in correlation coefficients of between 0.982 and 
0.995. 

We also compared velocity measureo derived from these Velotrace movement 
data with equivalent velocity measures reported in the literature. To do 
this, we calculated the velocity of the vertical component of the velar and 
Velotrace movements. The velocity functions were calculated from successive 
central difference scores for each sample point. Maximum upward velocity, 
occurring in the transition between a nasal and a fricative consonant 
(/fimzip/), may be as high as 130mm/sec; maximum downward velocity, occurring 
in the transition between fricative and nasal consonants (/fismip/), may be as 
high as lOOmm/sec. These values are similar^ but not identical to Kuehn's 
(1976) data, in which he reports downward velocities as high as 132mm/sec.^ 
We also calculated the linear correlation between the velar and Velotrace 
velocity functions for each token; all are within the range of r«:0.90 to 
r«0.97. On the basis of visual inspection of the functions, the very high 
Positive linear correlations between the two positional measures for each of 
the 12 tokens in our experimental set, and the equally high correlation 
between the velocities of these movements, we conclude that the internal lever 
accurately follows movements of the velum. 
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/iizmlp/ ^ /fimzip/ 
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Figure 5. Ensemble-average endoscopic data from one subject (above) and 
Veiotrace data from a second subject (below) for two utterance 
types (produced in isolation). Zero on the abscissa identifies the 
reference point for aligning tokens of an utterance type for 
computer sampling and averaging. 
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Figure 6. Ensemble-average endoscopic data from one subject (above) and 
Veiotrace data from a second subject (below) for one utterance type 
(produced in a carrier phrase). Zero on the abscissa identifies 
the reference point for aligning tokens of an utterance type for 
computer sampling and averaging. 
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Figure 7. Schematic drawing of lateral cine X-ray film f^^me image with the 
Velotrace in place and measurement points iii.icated: (1) tip of 
the Velotrace internal lever; (2) the point on the velum that would 
be tracked by the Velotrace; (3) Velotrace internal lever fulcrum; 
(^) upper molar reference point; (5) upper incisor reference point. 



/fimzip/ 




Figure 8. Comparison between the actual vertical movement of the velum and 
the movement of the tip of the internal lever of the Velotrace. 
The time function for one token is shown in the upper panel, with 
the acoustic waveform at the top and the Velotrace tip elevation 
(solid line) and velar elevation (dotted line) data below. A 
scatterplot of velar elevation and Velotrace tip elevation for each 
sample point in one token is shown in the lower panel. 
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Figure 9. Scatterplots of vertical velar position vs. Velotrace position data 
for one token of each utterance type. 

Conclusion 

In evaluating the Velotrace, we compared data collected with the 
Velotrace with data obtained from frame-by-frame measurements of cine films 
(both endoscopic and radiographic) and found that the Velotrace signal 
accurately reflects the relatively rapid movements of the velum during speech. 
Among the advantages of the Velotrace as a device for monitoring the velum 
during speech are the elimination of X-ray exposure for the subject, and of 
frame-by-frame measurement for the Investigator. Furthermore, the elimination 
of both of these drawbacks makes possible the collection and analysis of 
substantial quantities of data that should allow the development of a more 
complete understanding of velar motor control. In addition, the analog 
Velotrace signal can be sampled at a sufficiently high frequency to allow 
calculation of the highly accurate velocity and acceleration functions of 
velar movement patterns. 

The Velotrace has a number of potential applications, among them the 
study of velar kinematics in normal speakers. Additionally, it may be used to 
monitor velar function in a number of different speech pathologies. For 
example, by carrying out clinical studies of individuals suffering from 
neuromuscular pathologies that affect velar function during speech and 
swallowing, one should be able to provide objective descriptions of the nature 
of the disruptions of speech and swallowing, although the Velotrace does not 
provide information about lateral pharyngeal wall movement. Such information 
should also provide further insight into the nature of the organizational 
patterns of velar motor control. Another application would be to the study of 
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velar movement patterns in persons with velopharyngeal insufficiency, to 
examine the ways in which v.-^tical movements of the velum differ from, or are 
similar to, those of normal speakers: That is, do they use "normal" or nearly 
normal articulatory strategies that fail because of anatomical and/or 
physiological limitations? Similar studies could be conducted with persons 
having mobile repaired clefts, to identify their articulatory strategies. 
Finally, the Velotrace may serve as a biofeedback device for training 
individuals with a variety of velar function problems, including pre-lingual 
hearing impairment, as well as the disorders mentioned above. We would note, 
however, that extending the use of the Velotrace to studies of children's 
speech depends upon considerations of instrumental and anatomical size, as 
well as interference of the adenoids with the function of the internal lever. 
Furthermore, the use of this device to study velopharyngeal function in 
persons with anatomical anomalies may require modification of the device. 
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Footnotes 

^For Group 1 utterances, the beginning of [m] was chosen as the reference 
point (voicing onset following [p]; voicing onset or end of frication 
following [s]; increased amplitude following [b]; end of frication following 
[z]). For Group 2 utterances, the end of [m] was chosen as the acoustic 
reference point (voicing offset before [p]; voicing offset or beginning of 
frication before [s]; amplitude reduction or voicing offset before [b]; 
beginning of frication before [z]). For Group 3 utterances, the end of the 
medial consonant-sequence occlusion was chosen as the acoustic reference 
point (end of frication for [...sV] and the stop burst for [...tV] 
utterances) . 

^As a result of field-size limitations and because we were primarily 
interested in knowing how well the internal lever follows movements of the 
velum (rather than how well the external lever reflects movements of the 
internal lever), the external lever was not included in the viewing field. 

^Kuehn's displacement-versus-^time data, taken from the constant velocity 
portion of displacement-versus-time curves, are not directly comparable with 
our data for two reasons. First, his data are measures of Euclidean 
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distancer whereas ours are of vertical distance only. Second, his data are 
an index of the velocity during the relatively constant velocity portion of 
the gesture, whereas curs are the peak values in the first derivatives of our 
displacement-versus-time functions. However, using the angular factor that 
he reported, we have estimated the maximum y-trajectory velocities for each 
of his two subjects* The maximum y-trajectory upward velocities are 5^mm/sec 
and 90mm/sec; downward velocities are 38mm/sec and 83mm/sec. 
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TOWARDS AN ARTICULATORY PHONOLOGY* 
Catherine P. Browman and Louis M. Goldsteint 



A bstract , We propose an approach to phonological rep ^esentation 
based on describing an utterance as an organized pattern of 
overlapping articulatory gestures. Because movement is inherent in 
our definition of gestures, these gestural "constellations" can 
account for both spatial and temporal properties of speech in a 
relatively simple way. At the same time, taken as phonological 
representations, such gestural analyses offer many of the same 
advantages provided by recent nonlinear phonological theories, and 
we give examples of how gestural analyses simplify the description 
of such "complex segments" as /s/-stop clusters and prenasalized 
stops. Thus, gestural structures can be seen as providing a 
principled link between phonological and physical description. 

1 . Introduction 

The gap between the linguistic and physical structure of speech has 
always been difficult for phonological theory to bridge. Until recently, 
theories have encapsulated the linguistically-relevant structure of speech in 
a sequence of segmental units, each of which corresponds to a feature bundle. 
Under this strict segmental hypothesis (formulated in terms of features), the 
sequence of feature bundles that constitute segments forms a feature matrix, 
whose cells are organized into non-overlapping columns. Linguistically 
relevant contrast between utt-erances, in this approach, requires that at least 
one feature value differ between contrasting strings. The bridge to the 
continuous nature of speech is made by assuming that "each segment is 
characterized in terms of a state of the vocal organs, and the transitions 
between these states are ... predictable in terms of very general linguistic 
and physiological laws" (Anderson, 197^, p. 5). 

This strictly linear view of the relation between linguistic units and 
speech has come under attack in recent years from two different directions. 
Phonologists have found the constraint imposed by linear sequences of 
non-overlapping segments to be too extreme to capture a variety of 
phonological facts. Recognition of li^e i.?iportance of allowing feature 
specifications to overlap was made, e.g., by Anderson (197^). He presented an 
alternative approach that decomposed articulation into four subsystems (an 



*In press. Phonology Yearbook (Vol. 3, 1986). 
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energy source, a laryngeal system, an oral system, and a nasal system), and 
noted that "it is possible... [that] the boundaries of specification in one 
system will not coincide with the boiMdaries of a specification in another" 
(197^, p. 27^). The other direction c/ attack has come from phoneticians 
(e.g., Lisker, 197^), who have shown the linguistic relevance of the detailed 
temporal structure of s^peech. For example, as discussed below, 
interarticulator temporal organization may vary from language to language in a 
way that cannot be predicted (by any universal principles) from existing 
phonetic feature characterizations, and thus, must be specified somehow in 
language descriptions. These developments suggest a need for a revised 
conception of phonological/phonetic structure, one that incorporates 
overlapping phonological units and one that allows temporal relations among 
articulatory structures to emerge from the description. We consider the two 
.lines of attack in greater detail. 

The linearity assumption has been, challenged (if not completely 
discarded) by attempts over the last ten years to formalize more enriched 
conceptions of phonological structure. Like Anderson's (1974) proposal, these 
efforts were undertaken in response to the failure of the segmental model to 
account adequately f^^/^ certain facts. These conceptions include explicit 
incorporation of syllable structure (Hooper, 1972, 1976; Kahn, 1976), 
hierarchical metrical structures (Hayes, 1981; Liberman h Prince, 1977; 
Selkirk, 1980), dependd.icy structures (Anderson & Jones, 1974; Ewen, 1982), 
independent structural or autosegmental tiers (Clements, 1980; Goldsmith, 
1976), and explicit incorporation of a consonant-vowel skeleton (Clements & 
Keyser, 1983; Halle & Vergnaud, 1980; McCarthy, 1981, 1984; Prince, 1984). 
While these approaches have increased the range of facts that can be 
adequately formalized in phonological theory, they are inexplicit with respect 
to the relation between the revised conception of phonological structure and 
the physical structure of speech. The traditional link between phonological 
and physical structure has vanished along with strictly linear segmental 
analyses, and a new link has yet to be forged. It is this task we attempt in 
this paper, by accounting as simply as pojsible for the organization of speech 
in both space and time. We will show that the structures that emerge from 
such an account can also be used as a basis for phonological 
description—indeed, a kind o:f phonological description that is much in the 
spirit of the above-mentioned theories. 

From the phonetic side, there has been growing evidence that systematic 
phonetic feature representations cannot adequately describe phonetic 
differences among languages. Ladefcged (1980), for example, argues that the 
specification of features at the systematic phonetic level is neither 
"necesaary nor sufficient to specify what it is that makes English sound like 
English rather than German" (1980, p. 495). As both Ladefoged and Anderson 
(1974) point out, phonetic differences between languages may involve cjpects 
of speech that do not serve as the basis for phonological contrast within any 
one language. For example, Anderson shows that languages differ with respect 
to whether stops are released in clusters and in word final position, even 
though no single language contrasts released vs. unreleased stops* Thus, he 
suggests a fer^ture [±release] to differentiate the phonetic representations in 
these languages. 
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The difference between re? eased and unreleased stops can be seen as part 
of a more general problem: differences among languages in the relative tir»!lng 
of articulatory gestures. The "unreleased" initial stops in clusters are, 
presumably, released, but only after the occlusion for the second stop has 
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formed. There would be little acoustic evidence, therefore, of their release 
(cf. Catford, 1977)- Thus, language differences in stop release may be 
analyzed as differences in the temporal overlap of adjacent closure gestures. 
It is possible, in general, to describe such cross-language differences in 
gestural timing within the SPE framework (Chomsky & Halle, 1968) by means of 
features such as [±release]. However, the potential number and variety of 
such differences would lead to the proliferation of features that have no 
contrastive function within languages, (A similiar point about proliferation 
of phonetic features, but not specifically about timing relations, is made by 
Keating, 1984). 

It is not difficult to find documented examples of cross-language 
differences in gestural timing, A number of writers (e,g,, Flege & Port, 
1981; Keating, 1985; Mitleb, 1984; Port, 1981) have demonstrated such 
differences in voicing contrasts, specifically in the duration of vowels 
preceding voiced and voiceless stops. While the acoustic duration of a 
preconsonantal vov;el .-.s generally longer before a voiced than before a 
voiceless stop, the effect is larger in some languages than in others (as was 
earlier noted by Lehiste, 1970), and can be virtually absent (e,g,, in Polish, 
Czech, and Arabic), These differences in vowel di'ration presumably reflect 
differences in the relative timing of vowel and consonant gestures. In this 
case, a different feature, probably [±long], would be used in an SPE treatment 
to describe cross-language differences in gestural timing. 

Such phenomena are not restricted to voicing. For example, languages may 
also show differences in eject ive consonants in the time between release of 
glottal closure and release of oral closure (Catford, 1977, p, 69; Lindau, 
1984). There is no evidence that such differences in ejectives contrast in 
any language, and therefore, some ad hoc feature, similar to L±release], would 
have to be proposed to account for this difference. Finally, Fourakis (1980) 
has shown that the occurrence of so-called epenthetic stops in English words 
like <tense> is dialect-dependent. If such -'stops" are to be analyzed in 
terms of variation in the relative tuning of oral and velic gestures (rather 
than actual segment insertion, cf, Anderson, 1976; Ohala, 1974), then such 
timing relations are also not. universal, but are a properly of a particular 
language or dialect. 

In general, then, languages can differ from one another in the timing of 
(roughly the same) articulatory gestures. The above examples are meant simply 
to illustrate the variety of phenomena that can be analyzed in this way* An 
SPE characterization of such examples not only proliferates features in the 
grammar; more seriously, it misseo a generalization: timing of articulatory 
gestures is linguistically relevant, at least in terms of how languages are 
distinguished from one another. 

One approach to the description of language-particular patterns of 
articulator^ timing has involved positing rules that specifically convert 
segmental feature matrices to temporally continuous physical parameters (e,g., 
Keating, 1985; Port, 1981). As described by Port and O'Dell (1984), such 
"implementation" rules are like phonological rules in that they must be 
assumed to differ from language to language, but unlike such rules, they do 
not map feature valuer onto feature values. Rather they take features (or 
matrices thereof) as input and output "a complex pattern of graded commands 
distributed over time to the various articulators" (Port & O'Dell, 1984, 
p, 122), However, such rules for implementing patterns of articulatory timing 
have not been made explicit. This is not surprising, perhaps, since the 
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supposed output of the rules—control parameters for articulators— requires 
reference to the organization of articulatory movements. Yet no linguistic 
approach has provided a vocabulary for describing such organization. In the 
implementation rule view, the organization remains outside of speech itself, 
in the segmental (and metrical, etc.) structure. Speech itself has no 
organization, but is rather seen as a plastic medium that somehow serves to 
code the information present externally in the linguistic structure. It is 
worth noting that the implementation rules that have been successfully made 
explicit by Liberman and Pierrehumbert (198^^) characterize intonation. Here, 
the relevant physical parameter is univariate in the acoustic domain (F^), and 
no assumptions about articulatory organization are made. 

Rather than positing inexplicit implementation rules or proliferating ad 
hoc features, we propose to base phonological representation on an explicit" 
and direct description of articulatory movement in space and over time. As 
argued by Fowler (1980), incorporation of time (in particular) into the basic 
definition of phonetic units can simplify much of the complex translation that 
is required in an approach like that of implementation rules. Moreover, 
describing speech in terms of overlapping (relatively invariant) articulatory 
units with inherent time courses can ac^count in a simple way for observed 
patterns of acoustic variation (Bell-Berti & Harris, 1981; Fowler, 1980; 
Fujimura, 1981; Liberman, Cooper, Shankweiler, & Studdert-Kennedy , 1967; 
Liberman & Mattingly, 1985; Mattingly, 1981). Such articulatory descriptions 
are particularly promising for nonlinear phonological analyses that require 
overlapping features, because there is a clear physical reality underlying the 
decomposition of articulation into quasi-independent systems whose movements 
are not always synchronized. 

While there is, of course, a long history of referring to aspects of 
articulation in phonological and phonetic representations (e.g., Abercrombie, 
1967; Chomsky & Halle, 1968; Jespersen, 191^^; Ladefoged, 1971; Pike, 191^3), 
such representations have often been forced to rely on impressionistic 
descriptions, and have emphasized the static . aspects of articulation. Two 
recent developments in speech research make it feasible, we believe, to 
incorporate explicit characterizations of articulatory movement into 
>honological representation. The first of thesf; is the development of 
unproved technologies for tracking continuous articulatory movement (e.g., 
Fujimura, Kiritani, & Ishida, 1973)» which reduces the need to rely on 
impressionistic observations by providir^g more ex'plicit physical measurements. 
The second, the development of a theoreMcal framework for the analytical and 
;;iatheniatical description of coordinated movements (e.g., Bernstein, 1967; 
Fowier et al., 1980; Kelso & Tuller, 198iia, 198iib; Kugler, Kelso, & Turvey, 
1980; Saltzman & Kelso, 1983; Turvey, 1977), simplifies the description of 
movement sufficiently to make it tractable for phonological purposes, and also 
provides a framework in which to explore generalizations regarding coordinated 
articulatory movements. 

As we will attempt to show in this paper, an explicit description of 
articulatory movement can serve as the basis for phonological representation. 
The basic units In this framework are articulatory gestures (section 1.1). We 
will first define gestures simply as characteristic patterns of movement of 
vocal-tract articulators, or articulatory systems^ and will suggest 
phonological analyses of '-.wo linguistic pr^oblems in terms of these gestures 
and the relations a.iong them (section 2). Such analyses have many of the 
advantages found in recent nonlinear phonological tneories, while at the same 
time providing a possible solution to the problem of the missing link between 
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phonological and physical structures. We will then show how these 
characteristic patterns of movement can emerge from an abstract mathematical 
formalization of g'^stures and the lexical structures composed of such gestures 
(section 3). This mathematical formalization of a gesture, being developed in 
cooperation with our colleagues at Mask ins Laboratories, uses a dynamical 
model to explicitly characterize the coordinated patterns of articulatory 
movement. In this approach, based on the concept of coordinative structures 
(e.g., Turvey, 1977) as instantiated in the task dynamic model of Saltzman & 
Kelso (1983), gestures are autonomous structures that can generate 
articulatory trajectories in space and time without any additional 
interpretation or implementation rules. 

1 .1 Gestural Structure of Speech 

We begin our discussion of gestures with a simple example, the utterance 
[abe]. During this utterance, the lower lip moves up gradually toward the 
upper lip, reaches some peak upward displacement, an^l then moves downward 
again, as can be seen in Figure 1. The lip is constantly in motion, except 
Instantaneously at its maximum displacement — it does not necessarily achieve 
any steadystate configuration that could be associated unambiguously with the 
/b/. The absence of steady states characterizes all observations of speech, 
whether articulatory or acoustic, and this may be one of the reasons why 
phonologists have posited a complex relation between phonology and speech. We 
argue, however, that it is the assumption of a steady state specification that 
leads to the apparent complexity, and that a phonology which inherently 
Incorporates movement in its descriptions will simplify much of this apparent 
complexity. 
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Figure 1. Trajectory of the lower lip in [aba], as measured by tracking 
infrared LED placed on subject's lower lip. 223 
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Instead of looking for a steady-state correlate of the /b/ segment, then 
T< ^^"^ trajectory of the lower lip in [aba] as a oattern in space and 

time that characterizes utterances transcribed with a /b/ in them. That is 
there is no particular spatial coordinate value of the lower lip (or any other 
articulator) that is held for some time and that is characteristic of /b/ 
Rather, it is the movement of the articulators through space over time that 
constitutes an organized, repeatable, linguistically relevant pattern. We can 
refer to this pattern as a bilabial closure gesture. [Not every utterance of 
a word transcribed with a /b/ will display exactly the trajectory of Figure 1 • 
the trajectory will vary with vowel context, syllable position, stress,' 
speaking rate, and speaker. We must, therefore, ultimately characterize a /b/ 

^hi.%it'?^^^ °f P^,"^"^"^ °^ movement. In section 3, we will =,uggest how 
this family can be formally defined using an abstract set of equations that 
can generate the variant trajectories. For the present, we can think of a 
gesture as an instance of a family of related trajectories.] 

If every segment in a traditional phonological (or phonetic) 
representation could be described as one gesture, much as /b/ can be described 
as a bilabial closure gesture, then the implications of the gestural approach 
for phonology would be limited. However, the relation between segments and 
gestures is not always one-to-one. English voiced stops can, to a first 
approximation, be characterized as single gestures of bilabial, alveolar or 
velar closure; but other segments, such as the voiceless stops, require more 
than one gesture. In /p/, for example, we have a bilabial closure gesture 
much like that for /b/. In addition, however, the glottis must be opened for 
/p/, and then narrowed again. That is, from the point of view of 
spatio-temporal speech structure, /p/ is an organization of two gestures-a 
bilabial closure gesture plus a glottal opening (and closing) gesture. Thus, 
there is no one-to-one relation between gestures and segments. 

Nor do gestures bear a one-to-one relation to traditional phonological 
features. A single bilabial closure gesture would correspond to a number of 
features, such as [-continuant], [^anterior], [-coronal], [^consonantal], 
L-vocalic], etc. m general, differences in the presence or absence of 
glottal or velic (opening and closing) gestures correspond to single feature 
differences, while supraglottal constriction gestures correspond to multiple 
feature differences. Thus, gestures do not bear a one-to-one relation to 
either phonological segments or phonological features. Rather, they represent 
organized patterns of movemont within oral, laryngeal, and nasal articulatorv 
systems.* ' 
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In addition to the gestures themselves, the relations among gestures also 
play an important role in the articulatory description, similar to the role of 
tie associations among autoseg-nents in autosegmental phonology. As an 
example, consider the bilabial closure and glottal opening gestures in words 
transcribed as beginning with /p/. These gestures are not temporally 
simultaneous, but repeated observations of words beginning with /p/ reveal 
tight spatial and terrporal relations between the two gestures (LOfqvist 1980- 
LOfqvist & Yoshioka, 1985; cf. Lisker & Abramson, l96iJ). The incorporation of 
such spatio-temporal coordination within our description can be seen as having 
two different functions. On the one hand, the representation of a tight 
relation between the two gestures defines a phonological class— the class 
traditionally described as words beginning with /p/. On the other hand, the 
relation is specified in explicit enough fashion to capture the systematic 
language-particular aspects of the timing g^ween the Gestures. 
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In the particular example /p/, the gestural structure specified 

corresponds to a segment. In general, however, the interdependences among 
gestures are not restricted to those that constitute single segments in 
traditional approaches. Rather, the pattern of relations among a set of 
gestures, the gestural oonstellatioji , can serve the functions typically filled 
by other phonological structures, ranging from complex segments to syllables 
and their constituents. In section 2, we show how such constellations can 
provide the basis for phonological analyses in cases where featural overlap 
has been invoked, of the type explored by Anderson (197^) and developed 
further within autosegmental, including CV and X-tier, phonologies. 

2. Gestural Analysis of Two Linguistic Problems 

As noted in the Introduction, segmental and gestural analyses differ 
minimally for gesturally simple segments like /b,d,g/. Even in such cases, 
however, a gestural analysis has the additional advantage that it accounts for 
the physical movements of the articulators as i.oll as for the phonological 
structure. For gesturally more complex structures, the gestural analysis 
differs from a segmental analysis. In tnis section, we present two instances 
in which the analyses diverge: English /sZ-stop clusters (section 2.1) and 
prenasalized stops ^ section 2,2). Both are examples of "complex segments" 
(e.g., Ewen, 1982), that is, they behave in some ways like single segments and 
in some ways like clusters. Using the gestural approach, we attempt to answer 
the question of "one" vs. "two" units by analyzing the observed articulatory 
movements themselves. As ve shall see, this gestural analysis can provide 
structures that allow linguistic facts to be stated more generally and simply 
than in a segmental analysis. We will also see how these same gestural 
structures can account for some of the observed patterns of timing. 

2. 1 Glottal Gestures and /s/^Stop Clusters 

In a segmental phonology, the description of initial /s/-stop clusters in 
English is problematic. There are at least two facts about these clusters 
that require specific statements within che phonology— statements that apply 
only to these clusters. First, the phonotactics must state that there is no 
contrast between voiced and voiceless stops following initial /s/; that is, 
there is a "defective" distribution. Second, the realization of this stop as 
voiceless unaspirated must be specified by a separate phonetic (or 
phonological) rule. In current approaches, these facts do not follow from any 
more general characterizations of English phonology or phonetics.^ In 
addition, other problematic aspects of such clusters have led phonologists to 
argue that they are more "unitary," i.e., more nearly describable as single 
segments, than are other clusters. [Ewen (1982) summarizes the evidence for a 
monosegmental analysis, which includes the /s/-stop clusters' violation of the 
sonority hierarchy and the failure of these clusters to alliterate with /s/ in 
Germanic verse.] In this section, we will show not only how, in an 
articulatory phonology, the phonetic and distributional facts about the 
clusters follow from a more general constraint on the articulatory structure 
of English words, but also how the gestural structure might account for the 
clusters* ambiguous status as one or two units. 

Crucial to this account is an understanding of the behavior of the 
glottis in voiceless stops and clusters. This has recently been investigated 
in English (Yoshioka, Lttfqvist, & Hirose, 1981), Swedish (Lttfqvist & Yoshioka, 
1980a), Icelandic (Lttfqvist & Yoshioka, 1980b; Petursson, 1977), and Danish 
(Fukui & Hirose, 1983). Like English, these other Germanic languages contrast 
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an initial voiceless aspirated stop with either an unaspirated or voiced 
initial stop, but neutralise the contrast after /s/. In all cases, a single 
glottal opening/closing gesture is found for words beginning with /sC/ 
clusters (where C is a stop). This single gesture is similar to the one that 
occurs either with /s/ alone initially, or with one of the initial voiceless 
aspirated stops, although the magnitude of the gesture tends to be a bit 
smaller when accompanying the stops. As Petursson (1977) argues, the failure 
to find two glottal gestures in the initial /sC/ clusters cannot be due to a 
principle of economy of movement under which the glottis remains open for as 
many voiceless 3egments as are required. That is, the glottis is not, 
apparently, held in an open position during long periods of voicelessness 
(Yoshioka et al., I98l), but rather exhibits a sequence of opening and closing 
movements. This can be found, for example^ in /s#C/ sequences (i.e., 
sequences containing a word boundary), which can show two glottal opening and 
closing gestures. Thus, the failure to find two glottal gestures for initial 
/sC/ clusters points to a generalization about the linguistic organization in 
these languages—words begin with, at most, a single glottal gesture. 

These observations form the basis for the gestural analysis of /sC/ 
clusters. Here, contrasts are described in terms of different characteristic 
gesture constellations — that is, one or more gestures in a specific 
spatio-temporal relation. For example, /pa/ and /ba/ differ in the presence 
vs* absence of the glottal opening/closing gesture, /pa/ and /sa/ differ in 
that one has a bilabial closure gesture, the other an alveolar fricative 
gesture, both in constellations with a glottal gesture (with characteristic 
spatio-tempora:.. relations). The initial /sp/ cluster is a constellation of 
three gestures— an alveolar fricative gesture, a bilabial closure gesture and 
a single glottal opening/closing gesture. Thus, the glottal gesture can occur 
in a constellation with a single oral constriction gesture (as in /pa/ or 
/sa/), with two (as in /spa/), or alone (as in /ha/), [cf. Hockett's (1955) 
proposal of an immediate constituent analysis for /sC/ clusters along similar 
lines.] The phonotactic5 of English in this approach is a statement of the 
possible constellations of gestures* The generalization of interest here is 
that, in word-initial position, English has at most one glottal gesture, of 
roughly constant magnitude, regardless of the other gestures with which it 
co-occurs. 
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This generalization accounts for the lack of word-initial contrast in 
English between /sp/ and /sb/. That is, a word-initial contrast would require 
either two glottal gestures in the constellation for /sp/, or a different, 
much smaller, glottal gesture for /sb/. Both of these possibilities are ruled 
out by the single-glottal-gesture generalization. Moreover, in conjunction 
with a fact about intergestural coordination in English, it also accounts for 
the realization of the /p/ in initial /sC/ clusters as voiceless unaspirated. 
This additional fact is that peak glottal opening typically occurs at the 
midpoint of a fricative gesture, if there is one present in the constellation, 
or, if not, at the release of a stop closure gesture (Yoshioka, Lttfqvist, & 
Hirose, 1981). The generalization is presented in its current form in (1); it 
might ultimately be simplified by referring to the coordination of the glottal 
gesture with the vowel. 

(1) Glottal gesture coordination in English 

(a) If a fricative gesture is present, coordinate the peak 
glottal opening with the midpoint of the fricative. 

(b) Otherwise, coordinate peak glottal opening with the 
226 release of the stop gesture. 
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Statement (1) holds for both single consonants and consonant clusters- 
For single consonants, it accounts for the fact that initial voiceless stops 
are aspirated (since, by (lb), the peak glottal opening occurs at the release 
of closure), while initial fricatives are not (la). For clusters like /sp/, 
it accounts for the lack of aspiration. That is, since the peak glottal 
opening occurs during the fricative (la), by the time the following stop 
closure is released the glottis is already narrowed, producing a voiceless 
unaspirated stop. [The above analysis of aspiration is similar to a proposal 
by Catford (1977), who did not, however, explicitly discuss gestural 
organization.] 

Statement (1) can also account for another aspect of the phonetic 
structure of English — the devoicing of sonorants following initial voiceless 
stops but not following initial voiceless fricatives. For initial /pi/, for 
example, (lb) predicts that the peak of the glottal gesture is timed to occur 
at the release of the stop gesture, regardless of the presence or absence of 
other gestures. As Catford (1977) notes, the alveolar lateral 
(ijnpressionistically, at least) has already been achieved by the time of 
release of the stop closure, so that the wide-open glottis co-occurs with the 
lateral, producing a voiceless lateral. Thus, the voicelessness of the 
lateral follows directly from the nature of the gestures and the independently 
required generalization about gestural coordination, and does not have to be 
stated as a separate allophonic rule. In contrast, (la) predicts that 
sonorants will be only slightly devoiced following initial voiceless 
fricatives, and not devoiced at all in clusters such as /spl/, since the peak 
glottal opening occurs at the midpoint of a fricative gesture regardless of 
the number of following consonantal gestures. Thus, the intergestural 
coordination generalization captures a number of facts about English phonetics 
that would otherwise require separate statements. 

Returning to the single-glottal-gesture generalization, we note that it 
also has implications beyond its explanation of the defective distribution of 
/sC/ clusters. The ambiguous nature of such clusters is inherent in their 
proposed gestural constellations, consisting of a single glottal gesture with 
two overlapping oral gestures. These clusters might act as single units under 
the influence of the single glottal gesture, or as sequences of two units 
under thj influence of the two oral gestures. A similar analysis has been 
proposed for eject v;e clusters such as [t'p'] in Kabardian (Anderson, 1978). 
He argues that Kuipers' (1976) analysis of Kabardian as phonologically 
vowelless can be maintained if the ejective clusters are treated in a unitary 
way as complex segments. Their unitary phonological behavior is related by 
Anderson to their articulatory nature^ consisting of a sequence of two oral 
articulations associated with a single laryngeal gesture. Thus, he proposes 
an autosegmental-type analysis, in which a single laryngeal /specification is 
associated with a sequence of specifications for oral articulators. The 
parallel with the gestural constellations proposed here for /sC/ clusters is 
obvious. Both cases independently support our contention that gestural 
structures derived from observing articulatory movements provide an 
appropriate basis for stating phonological generalizations. Moreover, taken 
together, they suggest a general principle that a particular type of gestural 
structure (one laryngeal gesture organized with two oral ones) may be 
associated with ambiguous phonological behavior. 
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2.2 Prenasalized Stops and Nasal-Stop Clusters 

Prenasalized stops constitute another class of complex segments whose 
analysis has been used to enrich the strictly segmental phonological model 
(e.g., Anderson, 1976; Ewen, 1982; Feinstein, 1979). In this section, we will 
show that the difference between prenasalized stops and nasal-stop sequences 
posited in such analyses cannot predict the kinds of temporal regularities 
shown by nasal-stop sequences in English, unless certain ad hoc rules are 
added. These temporal regularities lead us to hypothesize a similar gestural 
analysis for prenasalized stops and English clusters, and we will present 
articulatory data to support this analysis. This gestural analysis, we will 
argue, captures the relevart phonological generalizations while allowing the 
temporal regularities to be predicted directly from the gestural organization. 

Anderson (1976) has presented arguments for analyzing prenasalized stops 
as single segments, but with a sequence of values for the feature [nasal] 
(this is consistent with Herbert's 1975 acoustic drta showing thac 
prenasalized stops have roughly the same duration as simple stops). Thus, the 
domain of value-assignment for the nasal feature is not coterminous with the 
boundaries between segments. In this way, the ambiguous nature (unitary vs. 
sequential) of such stops can be directly captured. His representation for a 
prenasalized stop is shown in (2a) and for a sequence of horaorganic 
nasal + stop in (2b). 

(2) (a) m b (b) m b 

cons + + + 

nasal + - + - 

ant + + + 

cor - - _ 



The structures represented in (2a) and (2b) might be expected to. lead to 
different phonetic entities. From the gestural point of view, we would expect 
to find a difference between the bilabial jlosure gestures in (2a) and (2b), 
with the prenasalized stop (2a) having a single bilabial closure gesture, and 
the nasal-stop cluster (2b) having either two bilabial olosure gestures, or, 
possibly, a single longer bilabial closure gesture. Since in English, words 
like <camper> and <canker> are analyzed as having nasal-stop sequences, we 
would expect them to have structures like that in (2b). [The analysis is 
supported by distributional considerations: /mp/, /mb/, etc. cannot occur in 
syllable-initial position where no sonorant-stop sequences are allowed, but 
they can occur post-vocalically , along with other sonorant-stop sequences.] 
However, certain durational properties of the words containing these clusters, 
specifically the duration of the preceding vowels, are not correctly predicted 
by representations such as (2b). 



The durational characteristics of the English words <camper> and <camber> 
were analyzed acoustically by Vatikiotis-Bateson (1981) and compared to words 
containing single segments, <capper>, <cabDer>, and <cammer>. His results 
showed, as expected from other studies (e.g.. Haggard, 1973; Lindblom & Rapp, 
1973: Walsh & Parker, 1982), shortening of the nasal and stop segments when 
they occurred in a cluster, compared to their durations as single consonants. 
However, contrary to expectations based on other studies (e.g., Fowler, 1983; 
Lindblom & Rapp, 1973), the stressed vowels preceding the nasal-stop clusters 
did not shorten when compared to the vowels before single consonants (i.e., 
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/mp/ vs. /p/ and /mb/ vs. /b/). That is, the labial nasal-stop sequences 
behaved like single consonants in terms of their effects on preceding vowel 
duration. -^reover, this similarity of behavior between the clusters and the 
single consonants extended to the effect on the preceding vowel duration of 
the consonantal voicing. In the Vatikiotis-Bateson data (as has also been 
shown by Lovins, 1978, and Raphael, Dorman, Freeman, & Tobin, 1975), vowels 
were shorter before the clusters containing a voiceless stop as well as before 
the single voiceless consonants, in spite of the fact that the nasal 
immediately following the vowel was voiced. That is, it was the voicing of 
the oral stop that determined the vowel length differences. 

In the above analyses of acoustic duration, the movement of the velum 
appeared to be irrelevant, since nasal-stop clusters behaved like single 
consonants. The similarity between the effects of clusters and singletons 
could be accounted for if there were a single bilabial gesture in English 
bilabials and nasal-stop sequences, regardless of the movement of the velum. 
Such a specification is best captured by representation (2a). This implies in 
turn, however, that nasal-stop sequences in English should have the same 
gestural structure as prenasalized stops. 

To test this hypothesis about the gestural structure of English 
nasal-stop sequences, we collected articulatory data for the 3ame set of words 
(containing bilabial stops and nasal-stop sequences) that Vatikiotis-Bateson 
analyzed, and for similar sequences in a language with prenasalfzed stops, 
kiChaka (Chaga), a Bantu language spoken in Tanzania (Nurse, 19^9). Chaga 
/mb/ is analyzed as a prenasalized stop and can occur word-initially in 
contrast to /m/ and /p/. In addition, word-initial /mp/ occurs in Chaga, but 
here the /m/ is analyzed as constituting a separate syllable. To investigate 
the labial sequences, we recorded a female speaker of American English and a 
male speaker of Chaga. The same overall format was used for both speakers. 
Each spoke the selected words containing bilabial sequences in a carrier 
phrase, repeating the words in the phrases five times. The selected words and 
carrier phrases are listed in (3). Notice that there is no /baka/, since all 
voiced obstruents in Chaga are prenasalized. 



(3) English 

phrase: it's a tomorrow. 



Chaga 

/wia mboka ^ kimbuho/ 

'say to the starter slowly' 



words ; 



capper 
cammer 
cabber 
camper 
camber 



/paka/ 'cat' 
/maka/ "year ' 

/mpaka/ 'boundary ' 
/mbaka/ •cur:i)e' 



EKLC 



The articulatory information for both speakers was gathered with a 
Selspot system at Mask ins Laboratories. To track th^ movement of the lips and 
jaw, miniature infra-red light-emitting diodes (LEDs) v^ere attached to the 
midpoint of the upper lip, the lower lip, and Just under the chin (or, in the 
case of the bearded Chaga speaker, slightly above the chin). A modified video 
camera positioned to capture the profile of the speaker then tracked the 
movement of the diodes. In addition to the lip and Jaw movement data gathered 
by the Selspot equipment, a gross measure of nasal airflow (during voiced 
speech) was obtained using an accelerometer attached to the bridge of the 
nose. 
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Figure 2 shows " he data for the English word <camper> in a carrier 
phrase. The acou. g signal is displayed in the bottom panel. The markings 
in each panel are derived from the acoustic signal — for example, NAS to 
indicate the onset of nasal murmur associated with the /m/, CLO for the onset 
of silence during the stop gap of the /p/, RL for the acoustic release of 
closure, AE and ER for voiced vocalic onsets. The top panel shows the 
information about nasality gathered by the accelerometer. Notice that, in 
addition to the nasal murmur (from 390 ms to HHO ms), the entire vowel /aa/ 
(from 290 to 390 ms) is nasalized. The middle panels display information 
about the vertical position of the upper and lower lips over time. Here we 
are most concerned with the labial closure gesture, extending approximately 
from 350 to 500ms. The upper lip has very little vertical movement, while the 
lower lip smoothly raises and then lowers for closure and release. The actual 
acoustic closure encompasses most of the peak of the labial gesture, from 390 
ms (beginning of the nasal murmur) to ^70 ms (release of the stop). [Note 
that closure is achieved before the highest point in the articulatory movement 
is achieved, and is not released (point labelled RL) until after the lip 
begins to move down again. The lips are, therefore, continuing to move even 
during acoustic closure, presumably compressing the tissues involved.] We will 
be concentrating on this lower lip movement as we investigate the articulation 
of the various labial sequences. 
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Figure 2. Acoustic waveform and articulatory measurements for single token of 
<camper>. 
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Our primary interest in the labial gestures is in the similarities and 
differences among gestures in different pnono logical categories, that is, 
single consonants, prenasalized consonants, nasal-stop clusters, and syllabic 
nasal plus stop. Both speakers proved to be quite regular across tokens 
within the same phonological category. Typical examples are shown in Figure 
3a for the English speaker (the medial /p/ of <capper>), and Figure ^a for the 
Chaga speaker (the initial /m/ of /maka/). In both figures, lower lip 
gestures for two repetitions of the word are superimposed and displayed above 
the acoustic signal from one of the tokens. The vertical lines indicate the 
onset and release of closure, as measured in the acoustic waveform. To 
compare tokens between phonological categories, a single repetition was chosen 
from each set of five for each of the words to be compared. In each case, 
this representative item (selected from the second, third, or fourth 
repetition) was identical, as determined by visual inspection, to at least one 
other repetition, both in terms of the pattern over time of the lower lip 
gesture, and in terms of the timing of the gesture relative to the surrounding 
vowels. These representative items are used in the rest of the figures; the 
conclusions based on these items have been confirmed by comparisons among all 
the repetitions. 

The between-category comparisons indicate that, contrary to expectations 
based on segmental descriptions such as (2b), all of the phonological 
categories except for the syllabic nasal+stop are represented by a single 
labial gesture. That is, there is no systematic difference among the labial 
gestures associated with a single consonant, a prenasalized consonant, and a 
consonant cluster. This can be seen in Figure 3b for English, and Figure 
for Chaga. 



Figure 3b shows the lower lip traces for English <cabber>, <cammer>, 
<camper>, and <camber> superimposed on the trace for <capper>. [The gestures 
have been slightly offset, both horizontally and vertically, to facilitate 
comparison of their overall forms. The extent of the horizontal offsets can be 
determined from the lines on the left; the vertical offsets are represented by 
the tick marks on these lines.] While there are small differences in the 
amplitude and in the slope of the onset and offset of the gesture, which 
correspond to similar differences among /p/, /b/, and /m/ reported in the 
literature (e.g., Kent & Moll, 1969; Sussman, MacNeilage, & Hanson, 1973), the 
overall envelope of the gestures is similar, particularly in the central 
portion demarcated by the lines on the <capper> trace. That is, regardless of 
whether the consonantal portion is described as a single consonant (/b/, /p/, 
or /m/) or as a consonant cluster (/mp/ or /mb/), in English there appears to 
be a single labial gesture. 

Figure ^b shows the superimposed lower lip gestures for the Chaga words 
/paka/, /maka/, and /mbaka/. Again, as in English, there is a single gesture, 
quite similar in overall envelope, and particularly in the central portion. 
That is, in Chaga there is a single labial gesture associated with single and 
prenasalized consonants. 

The syllabic nasal+stop /mp/ in Chaga, however, presents a different 
picture. Comparing /maka/ and /mpaka/ in Figure 5a, we see for the first time 
a clear difference in the overall duration of the envelope of the lower lip 
gesture. The gesture for /mpaka/ is clearly longer, as can be confirmed by 
checking the /maka/- /paka/ comparison in ^'Igure 5b. This difference in 
duration, we argue, is the result of two overlapping labial gestures. To see 
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Figure 3. Comparison of lower lip trajectories for English words, 
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that this might be the case, consider Figure 5c, in which the gestures for 
/maka/ and /paka/ are superimposed in an overlapping fashion, and Figure 5d, 
in which the gesture for /mpaka/ is superimposed on the overlapping gestures 
for /maka/ and /paka/. The close correspondence observable in the figure 
holds across all the repetitions for these utterances. That is, the gesture 
for syllabic /m/ plus /p/ corresponds closely to the individual gestures for 
/m/ and /p/ arranged sequentially with partial overlap. An alternative 
description could be suggested, namely that the bilabial closure gesture in 
/mp/ was simply "larger." Note, however, that both the amplitude and the 
slopes of the onset and offset are unchanged from the single consonant case. 
This argues for overlap, or else another mechanism that simply holds the peak 
of a gesture, rather than for a larger gesture, since, as reported by Kelso, 
Vatikiotis-Bateson, Saltzman, and Kay (1985), larger gestures (due to changes 
in st»"es3 and rate, at least) typically have both increased amplitude and 
steeper slopes. We provisionally prefer the analysis of overlap to that of a 
held peak, since it requires mechanisms that must in any case occur in the 
phonology, namely overlap among gestures involving different articulators. 
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Figure 5. Comparison of lower lip trajectories for Chaga /mp/, /m/, and /p/. 

Thus, the articulatory evidence suggests that syllabic /mp/ is a gestural 
constellation including two partially overlapping labial gestures. This 
distinguishes it fr C!^ Chaga prenasalized stops and English nasal-stop 
clusters, both of which are constellations involving a single bilabial closure 
gesture. That is, as we hypothesized, the English nasal-stop clusters and 
Chaga prenasalized stops would both, using Anderson's (1976) framework, be 
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represented as (2a). Representation (2b), however, would be more appropriate 
for the Chaga syllabic /mp/. 

How then, given the similarity between their gestural structures, do we 
capture the distinction between prenasalized stops in Chaga and nasal-stop 
sequences in English? The simplest statement is as a distributional, or 
phonotaotic, difference. That is, in Chaga such gestural structures can occur 
in word (and/or syllable) initial position, whereas in English the same 
gestural structures cannot occur in initial position. :hus, we can account 
for the difference between prenasalized stops and nasal-stcp clusters as 
different distributional characteristics of the same physical structure. 
However, such a distributional difference can only serve to distinguish 
prenasalized stops and nasal-stop clusters in two different languages. We 
still need to address the issue of how prenasalized stops differ from 
nasal-stop clusters in a language where they contrast. In such a case, we 
expect an articulatory difference between the two. 

Feinstein (1979) describes one such language, Sinhalese. We know that 
there is in fact a physical difference here between the nasal-stop clusters 
and prenasalized stops: Feinstein reports that the nasal is longer in /nd/ 
clusters than in prenasalized /"d/. This difference might reside either in 
the relative timing of the oral and velic gestures, or in the oral closure 
gesture itself, which could be longer, or doubled, for the clusters. We would 
in fact expect the latter to be true, because the nasal-stop clusters are part 
of a morphological class of inanimate nouns in which gemination is an active 
process. That is, members of this class containing oral stops alternate 
between single and geminate stops in the plural and definite singular (e.g., 
[potu] and [potta] 'core', Feinstein, 1979), while members containing 
nasal-stop sequences alternate between prenasalized and nasal-stop clusters 
(e.g., [ka^du] and [kanda] 'hill', Feinstein, 1979). Such a classification 
would be simply explained if the identifying characteristic of the class were 
the lengthening, or doubling, of the oral gesture. For the oral stops, this 
would result in a geminate, while for the prenasalized stops, it would result 
in a lengthened nasal, i.e., a nasal-stop cluster. 

Such a characterization, combined with a gestural reformulation of their 
syllable template, also directly captures Feinstein's (1979) and Cairns and 
Feinstein 's (1982) analysis of the difference between the prenasalized and 
nasal-stop sequences in Sinhalese as a difference in syllable structure, with 
the prenasalized stops being tautosyllabic (syllable initial) and the 
nasal-stop sequences being heterosyllabic. In the gestural reformulation, the 
terminal nodes of the syllable template would be oral gestures, rather than 
segments, and syllable onsets would be restricted to single oral gestures. 
Since the velic gesture under this formulation is not re. /ant to the syllable 
structure, either prenasalized or single oral stops (assuming both have a 
single oral gesture) could occur in the syllable onset. That is, 
lengthened/doubled oral gestures would be heterosyllabic, while single oral 
gestures would be syllable initial^ regardless of their cooccurrence with 
velic gestures. Such a reformulation correctly captures the syllabification 
difference in the nasal-stop sequences, and eliminates the need for a separate 
language-specific statement about the priority of prenasalization. 

As we have shown, then, the gestural structures for prenasalized stops 
and nasal stop clusters lead to both a simple statement of their physical 
properties, and a satisfactory description of their phonological properties. 
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Anderson's (1976) analysis of prenasalized stops predicts their temporal 
properties fairly well, but for nasal-stop sequences in English, some rule 
would be required to collapse a structure like (2b) into a structure like 
(2a). Moreover, the structure in (2a) then must be mapped onto an 
articulatory representation, much like that which we take as our basic 
phonological representation. In contrast, in an articulatory phonology, the 
representation in terms of a spatio-temporal organization of gestures directly 
captures the articulatory structure as well as providing simple statements of 
phonological generalizations. 

3. Preliminary Formalisms 

The analyses of section 2 suggest that spatio-temporal descriptions of 
articulatory movements can, in fact, provide the basis for stating 
phonological regularities. In order to state such generalizations explicitly, 
the gestural structures must be formalized in some way, and it is to such 
formalizations that we turn in this section. We should note that these 
suggestions for formalization are preliminary and incomplete, and their 
implications for the description of a wide range of phonological data have not 
yet been investigated. Our preliminary attempts here are intended to simply 
show that the kind of structures that we have been arguing for can be 
rigorously formalized and that their adequacy in accounting for complex 
phonological data can, therefore, be tested. 

We begin by describing one promising approach to a dynamical 
specification of gestures and their coordination in section 3.1. This 
dynamical specification is meant to be detailed enough to account for 
phenomena such as coarticulation and language-particular timing patterns. In 
section 3.2, we show that a simplified, more qualitative notation can be used 
to index these dynamically-defined structures. These indices are a simplified 
way of representing gestural structures, appropriate to such linguistic 
functions as lexical contrast and description of phonological generalizations. 

3.1 Specification of Gestures and Inter-gestural Relations 

We have been using the notion of an articulatory gesture as a 
characteristic pattern of movement of an articulator (or of an articulatory 
subsystem) through space, over time. How can we precisely define such 
spatio-temporal patterns? We might attempt to specify the values of 
articulator position at successive points in time. Such an approach, however, 
in which time is explicitly one dimension of the description and spatial 
position another, has trouble with the complex variations in articulatory 
trajectories introduced by changes in speaking rate and prosodic factors. It 
would be preferable to view the a.ange in position over time as the output of 
a more abstract system, such as a dynamical system, capable of generating a 
variety of related trajectories. 

Dynamical systems (e.g., Abraham & Shaw, 1982; Rosen, 1970) have been 
applied to problems of motor coordination in biological systems in general 
(e.g., Cooke, 1980; Fel'dman, 1966; Kelso & Holt, 1980; Kelso, Holt, Rubin, & 
Kugler, 1981; Kelso & Tuller, 198ila; Kugler et al., 1 980; Polit & Bizzi, 1978) 
and to the organization of speech articulators in particular (Fowler, 1977; 
Fowler et al., 1 980; Kelso & Tuller, 198ilb; Kelso, Tuller, & Harris, 1983; 
Lindblom, 1967; Ostry & Munhall, 1985). Space and time are not specified in a 
point-by-point fashion in a dynamical system, but the system is capable of 
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specifying characteristic patterns of movement that are organized in space and 
time. There are two properties of such systems that make them useful to the 
description of linguistic gestures. First, for a given fixed specification of 
system parameters, the system can output an infinite number of different (but 
related) trajectories, as a function of the initial conditions of the 
articulators, and as a function of other dynamical systems (for other 
gestures) that might be simultaneously active. At least some trajectory 
context-dependence (i.e., coapticulation) can be handled in this way, by 
characterizing a gesture in terms of the invariant input parameters to such a 
system. Second, although the articulators are moving throughout such a 
gesture, the equation itself does not vary over time, buc rather characterizes 
the whole pattern of movement. Thus, the traditional notion of a discrete 
element, imposed on speech from the outside, is replaced by the notion of 
coherent gestural movements that can be described by a single system of 
equations (cf. Fowler et al., 1980). 

To exemplify such equations, consider a physical example of a dynamical 
system, a mass attached to a spring. If the mass is pulled, stretching the 
spring beyond its rest length (equilibrium position), and then released, the 
system will begin to oscillate. Assuming that the system is without friction, 
the resulting movement trajectory of the mass can be described by the solution 
to equation (^). 

(^) mX + k(x - Xq) = 0 

where m = mass of che object 

k = stiffness of the spring 

Xq = rest length of the spring (equilibrium position) 

X = instantaneous position of the object 

X = instantaneous acceleration of the object 

Thus, a time-varying trajectory is generated by an equation that does not 
itself change over time. Different trajectories can be obtained from this 
same system by different choices of values for the dynamical parameters m, k, 
and Xq, and by different initial conditions for x and X. Changing the 
stiffness k of the spring will affect the frequency of oscillation of the 
mass, while changing the rest le.igth (equilibrium position) of the spring x^ 
and the initial position x will affect the amplitude of oscillation. 

Recently, it has been shown that such dynamical systems can account for 
systematic trajectory differences associated with linguistic variations in 
stress (Browman & C}oldstein, 1985; Kelso et al. , 1985; Ostry & Munhall, 1985). 
In these papers, mass-spring models such as (^) provided an abstract 
description of the articulatory movements associated with lip closure. Thus, 
the X in equation (^) was taken to represent the vertical position of the 
lower lip, instead of the length of a spring. The lower lip in the stressed 
syllables took more time to move between the displacement peaks and valleys, 
which was modelled by decreasing the stiffness (k) in equation (^). The lower 
lip in stressed syllables also moved a greater distance, which can be modelled 
by increasing the difference between the rest length of the lower lip (Xq) and 
the initial position (although this aspect of the modelling was less 
thoroughly explored in the above papers). 
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A gesture such as that for bilabial closure cannot be fully described by 
the movement of a single articulator; the coordination of a number of 
articulators is required, i.e., the jaw, the lower lip, and the upper lip. 
The task dynamics of Saltzman and Kelso (1983) offers a promising approach to 
modelling this coordination. Task dynamics provides an organization of 
articulatory movement that is defined in terms of a particular task to be 
performed— in our example, lip closure. It relates this task closure to the 
movement of the various articulators involved in its performance, in 
particular the Jaw, the lower lip, and the upper lip. The positions of these 
articulators are anatomically linked— as the jaw moves, for example, the lower 
lip can move along with it. Because of this fact, lip closure can be achieved 
in a number of different ways, from moving -nly the jaw, to moving just the 
lower lip with relatively little jaw movement. It is this flexibility that 
allows a phonetic task such as bilabial closure to be achieved even when the 
movement of one of the component articulators is mechanically restrained 
during speech (Abbs, Gracco, ? Cole, 198iJ; Kelso, Tuller, Vatikiotis-Bateson, 
& Fowler, ^9B^^), Such compensatory behavior has been successfully modeled by 
the task dynamics approach (Saltzman forthcoming). In addition, this 
flexibility can account for aspects of coarticulation such as the fact that a 
bilabial closure gesture is produced with a higher jaw position in [bi] than 
in [baej (Sussman et al. , 1973). The default contributions oL the component 
articulators to a given task, in the absence of perturbation or 
coarticulation, can be specified in terms of characteristic weightings for 
these articulators. These weightings may vary for different gestures : for 
example, the upper lip is weighted quite differently in bilabial closure and 
labiodental fricative gestures. 

A gesture, then, is defined by specifying (i) a dynamic equation (or a 
set of them), (ii) a motion variable or variables, i.e., the variable(s) to 
substitute for x in equation (H) or other dynamic equation, (ill) values for 
the coefficients of the equation (the dynamic parameters), and (iv) weightings 
for Individual articulators. An initial application of this definition of 
ffoa.T^ presented in Browman, Goldstein, Kelso, Rubin, and Saltzman 

(198H). They employed a single undamped second order system (such as (^)) 
defined for two motion variables, lip aperture (vertical distance between the 
upper and lower lips) and lip protrusion. Different values for the dynamic 
parameters (stiffness and equilibrium position) were employed on alternate 
motion cycles, so as to generate the articulatory trajectories appropriate for 

an alternating stress sequence /'mama 'mama /. The computed output 

trajectories of the upper lip, lower lip, and jaw were then used to control a 
vocal-tract simulation (Rubin, Baer, & Mermelstein, 1981) that synthesized 
speech. Thus, a very simple specification of a bilabial closure gesture in 
terms of dynamically defined variables for lip aperture and protrusion 
successfully captured the information necessary to produce convincing speech. 
At the same time, as we have seen, such gestural descriptions are useful as a 
basis for phonological description. 

In order to generate the complete inventory of speech sounds, ge.nures 
must be combined into constellations. Again, ad with the gestures themselves, 
the relations among gestures can be specified abstractly using spatio-temporal 
phase relations (Kelso & Tuller, 1985). A specification in terms of phase is 
neither solely spatial nor solely temporal, because the exact point in time 
associated with a particular phase angle will change as the frequency 
(stiffness) of the gesture changes, and the exact point in space will change 
as the amplitude (rest length) of the gesture changes. Rather, phasing 
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specifies relations among characteristic spatio-temporal patterns. Empirical 
evidence in favor of a spatio-temporal approach has been presented by Tuller, 
Kelso, and Harris (1982) and Tuller and Kelso (198^). For example, Tuller et 
al. (1982) have shown that in sequences like ['papip] and [pa 'pip], the time 
of onset of lip activity for the intervocalic consonant relative to jaw or 
tongue activity for the initial vowel is quite variable across the two 
different stress patterns and across changes in speaking rat^;.. This indicates 
that the purely temporal approach cannot specify gestural relations in a 
sufficiently general way. However, Tuller et al. go on to show that the onset 
of lip activity for the intervocalic consonant can be quite precisely (and 
linearly) related to the period of the vocalic cycle, defined as the time 
between the onset of activity for the first vowel and the onset of activity 
for the second. This linear relationship remains invariant across changes in 
speaking rate and stress. [Similar constancies in relative timing of acoustic 
events, across changes in rate, have been reported by Weismer and Fennell 
(1985).] Kelso and Tuller 0985) have further analyzed their movement data in 
terms of phase and have shown that the consonant gesture begins at a fixed 
phase angle in the vowel cycle. As the vowel period changes with stress and 
rate, the absolute time corresponding to that phase angle will also change, in 
a systematic way. 

We take, then, as a first hypothesis that gestures can be characterized 
in terms of a dynamical system and its associated motion variables and 
parameter values, and that intergestural relations can be .specified in terms 
of their phasing. This framework can accommodate th3 cross-linguistic timing 
differences discussed in section 1.0 quite naturally, although the analysis in 
any particular case (e.g., phase differences vs. stiffness differences) 
remains to be determined by the relevant spatial and temporal data. 

3.2 Contrast rve Articulatury Structures 

So far, gestural organizations havf been described in terms of attributes 
of the motions of physical articulators, including the more abstract and 
general physical descriptions provided by dynamics, and specifically task 
dynamics. That is, we have shown in the last section how it is possible to 
capture spatio-temporal articulatory structure, using a dynamical framework. 
In this section, we continue to develop a formalism for articulatory 
phonological representation by laying out some sample lexical descriptions. 

One function of a lexical description is to provide information about the 
pnysical structure of an item so that linguistically significant similarities 
and differences among lexical entries can be observed in as simple and direct 
a way as possible. Since we are dealing solely with articulatory structure, 
the domain in which linguistic facts such as distinctiveness can be stated 
consists of the set of articulatory gestures and their relations. The 
descriptions in this section differ from those in the previous section only in 
terms of the degree of detail in the description. It is as if, in the present 
section, we have decreased the resolution on our microscope so that the 
descriptions are coarser-grained and more qualitative. Thus, we are referring 
to the same set of dynamically specified gestures, but this ti;Tie using symbols 
which serve as indices to entire dynamical systems These symbolic descriptions 
highlight those aspects of the gestural structures that are relevant for 
contrast among lexical items. In our discussion of lexical representations, 
then, there are three important considerations to keep in mind. First, the 
minimal units in the lexical representation are dynamically-defined 



Browman & Ctoldstein: Towards an Articulatory Phonology 



artiojlatory gestures. Second, these gestures are spatio-temporal in nature. 
Third, the gestures are organized asynchronously, with varying degrees of 
overlap among the gestures. 

Table 1 shows the symbolic notation we are suggesting to index the 
gestures relevant to the English and Chaga words discussed in section 2.2. 
The symbols are shorthand for specific sets of dynamical equations and their 
associated motion variables and parameter values that can generate the kinds 
of articulatory trajectories seen in Figure 6. These gestural trajectories 
are based on the articulatory data for <camper>, an example of which was shown 
in Figure 2. The bottom panel is the acoustic signal. The middle 
articulatory panel is the actual recorded trace of the vertical movement of 
the lower lip. This closing and opening gesture of the lips is indexed by the 
S in Table 1. The other two panels of articulators are estimates only, and 
show the amount of opening associated with the gesture, rather than actual 
vertical height. The bottotr articulatory panel displays a representative 
glottal gesture associated with voicelessness : the peak indicates the maximum 
opening of the vocal folds. [The shape of the glottal gesture was estimated 
from Sawashin.i & Hi rose (1983), while the timing was based on the acoustic 
oignal, combined with information from various studies on glottal timing 
(LCJfqvist, 1980; LCJfqvist & Yoshioka, 1985).] Here^ a Y in Table 1 represents 
the glottal opening and closing gesture. Note that the presence of the 
glottal gesture means an open glottis, i.e., voicelessness. 

The top panel estimates the opening of the velo-pharyngeal port, based on 
the accelerometer record and published data on velum movement ( e. g. , Kent, 
Carney, & Severeid, 1971; Vaissiere, 1981). These data indicate that in an 
utterance like <camper>, velum lowering (i.e., velic opening) begins at the 
release of the initial consonant, and velum raising (i.e., velic closing) 
begins at some time before the achievement of articulatory closure for the 
/mp/. The velum movement is represented by two separate gestures in Table 1, 
an opening gesture indicated by a +u (nasal), and a clcsing gesture indicated 
by a -jj (non-nasal, i.e., oral). The decision to treat velic opening and 
closing as two separate gestures, as compared with the glottal and oral 
gestures that incorporate both opening and closing, is based on the fact that 
each velic gesture may act as a word-level phenomenon, so that the velum can 
possibly be held in either a closed or an open position indefinitely. Kent et 
al. (1971) provide an example of the latter phenomenon: a long sentence wit'n 
many nasal consonants in which the velum remains lowered throughout. This can 
also be seen in nasal harmony as well as in non-distinctive nasalization. In 
addition, the velum opening and closing gestures may require different amounts 
of time; for example, Benguerel, Hirose, Sawashima, and Ushijima (1977) show 
that for a French talker, opening gestures are slower than closing gestures. 
It is possible that a comparable internal gestural structure will also be 
needed for oral and glottal gestures. 

Figure 6 also illustrates the timing relations among gestures. Note that 
the overlap among the gestural trajectories reflects the inherent 
spatio-temporal properties of the gestures as well as their asynchronous 
organization. That is, the gestures that form the /mp/ constellation all 
require a certain amount of time to unfold, but they are not necessarily 
synchronized in the sense of their onsets (or peaks, or offsets) being 
aligned. The velic opening, for example, begins at least 100 ms before the 
labial gesture begins, and velic closure begins sometime in the middle of the 
labial gesture. This asynchrony results in the vowel being nasalized, and the 
labial gesture being partly nasal and partly oral- The glottal gesture is 
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Table 1 
Gestural Symbols 



symbol gesture 

S bilabial closing and opening 

y glottal opening and closing 

(returns to voicing position) 

+M velic opening 

"M velic closing 

V vcwel 



velic 
opening 
(estimated) 

lower lip 
height 
(actual) 



glottal 
opening 
(estimated) 



acoustic 
waveform 




50 



NAS 




m 



100 150 200 

time in milliseconds 




ar 



—I r— I I 

250 300 350 
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Figure 6. Acoustic waveform and bilabial, velic, and glottal gestures for 
<camper> . 
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delayed with respect to the labial gesture, so that the glottis most likely 
reaches its peak opening after the peak closure of the labial gesture, but 
before its offset. Thus, a single glottal gesture, requiring a certain amount 
of time to unfold and asynchronously aligned with the labial closure gesture, 
generates both voicelessness and aspiration (as originally proposed by Lisker 
and Abramson 196i<). In addition to the particular gestures, then, our 
symbolic representation must capture aspects of the phase relations among the 
gestures, because as we shall see, contrasting items may include the same set 
of gestures, but in different relations. 

For the sake of our symbolic representations, we project gestural 
constellations into a two-dimensional representation which captures 
qualitative aspects of these relations that are important for contrast (or for 
certain kinds of phonological generalizations). Examples of such 
constellation projections for the English words investigated in section 2,2 
are shown in Figure 7, with gestures for the initial consonant omitted. Chaga 
words beginning with /m/, /p/, and /mb/ have representations like those of the 
comparable English words, except that the initial V is not present for these 
words. The representation of Chaga /mpaka/ (with syllabic /m/) is also shown 
in the figure. Note that vocalic gestures are indexed simply as V, because 
their gestural aspects have not been investigated here. 

The vertical dimension of these representations is organized by 
articulatory subsystem. Gestures of the oral subsystem are found on the top 
two lines, gestures of the laryngeal subsystem are found on the third line, 
and velic gestures are at the bottom. The particular ordering (from top to 
botto:n) is meant to relate the gestural structures to the more global 
(syllable and foot) rhythmic organization of speech. (This rhythmic 
organization, corresponding to, e.g., metrical trees or grids, or CV skeleta, 
is itself not yet incorporated in these structures). The closer to the top a 
gesture is, the more relevant it is presumed to be in carrying the overall 
rhythm. Thus, vowel gestures are found on the top Ijne, as they seem to be 
most important in carrying che speech rhythm, with other gestures being 
coproduced with them (cf. Fowler, 1983). Vellc gestures, by contrast, are 
placed at the very bottom, because they contribute very little, by themselves, 
to the rhythmic structure. 

The horizontal dimension of the representations in Figure 7 consists of a 
grid that can be used to give a qualitative indication of the relative phase 
relations of the gestures. The lines of the grid represent roughly 90 degree 
(quarter cycle) phase intervals. Two gestures that are lined up on the same 
grid line are assumed to be relatively synchronous. For example, their 
onsets, or their maximum displacements, might coincide in time. The grid 
lines are not, however, meant to indicate any special structural relation 
between lined-up gestures. For example, it is not necessarily the case that 
one of the gestures governs the other, or that there is a particularly 
cohesive (or tightly invariant) relationship between the gestures. Gestures 
on successive grid lines are assumed to have approximately a 90 degree phase 
relation (e.g., the displacement maximum of one gesture synchronized with the 
velocity maximum of another); those two lines apart are assumed to have a 180 
degree relation; etc. 

As an example, consider the constellation for English <camper>. The 
placement of the two V symbols ir.dicates that they are phased 36O degrees 
apart (they are separated by four grid intervals), and thus, they represent 
one complete vowel cycle. For expository purposes, we can think of this vowel 
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cammer 



cabber 



capper 
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V 













camber 




camper 




m paka 




Figure 7. Gestural constellations for English words, and for Chaga /mpaka/ 
(see text for interpretation). 
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cycle in terms of the action of the jaw-^high for the .cnsonant, low for the 
vowel. The first grid line can be thought of as corresponding to the onset of 
the first vowel gesture--the beginning of the descent of the jaw from its 
maximum height towards its low position for the vowel. The third grid line 
can then be used for gestures that are 180 degrees out of phase with respect 
to the vowel gesture, i.e., a gesture whose onset occurs l80 degrees into the 
vowel cycle, when the minimum jaw height is reached for the vowel. This is 
(roughly) the phase relation between the bilabial closure gesture (on the 
third grid line) and the vowel gestures reported by Kelso and Tuller (1985). 
Note that this representation also directly captures the notion (Ohman, 1966; 
Fowler, 1983) that consonant gestures overlap a continuous vowel production 
cycle. 

The glottal gesture in <camper> is positioned 90 degrees out of phase 
with the bilabial gesture. This would be the case, for example, if the peak 
glottal opening were synchronized with the point of peak velocity during the 
opening portion of the bilabial gesture, as is consistent with data on the 
timing of glottal opening for aspirated stops (e.g., LOfqvist, 1980), and as 
can be seen in Figure 6. An unaspirated step, as contrasted with an aspirated 
one, would be represented by synchronizing the bilabial and glottal gestures. 
Turning to the velic gestures, we note that the velic closing gesture (-p) is 
lined up with the glottal gesture. This could correspond to the maximum 
displacement of the two gestures being synchronized—peak glottal opening with 
the maximum velic closure. Note also that the velic opening gesture (+p) is 
positioned on the first grid line, directly capturing the fact that the velum 
begins to open sometime close to the onset of the vowel (as indicated by the 
nasal accelerometer, and also as seen in Kent et al., 197^). 

Such representations not only provide qualitative information about the 
articulatory structure for individual items, but also serve to differentiate 
items from each other. Compare, for example, the representations for English 
<camper> and Chaga /mpska/. Here, based on our articulatory measurements, the 
distinction between the English /mp/ and the Qucra syllabic nasal+stop i? 
represented by a second labial gesture for me Chaga. This gesture is 
positioned on the top line (where normally only vowel gestures occur), in 
order to capture the fact that this bilabial closure gesture assumes the 
syllabic function within rhythmic structure that is more usually filled by 
vcwel gestures. 

Another pair of lexical items, <cammer> and <camber>, demonstrates the 
distinctive use of phase structure in the representations. The only 
difference between the representations for <cammer> and <camber> lies in the 
phasing of the velic closing gesture. In <camber>, the velum closes during 
the labial gesture, so that there is a brief period of non-nasalized closure. 
In <cammer>, on the other hand, the velum closes sometime after the labial 
gesture is released. This is captured by different grid positions of the 
velic closure gesture. (The phasing of the gestures is inferred from evidence 
provided by the nasal accelerometer combined with the acoustic signal.) 

It is important to note that the physical descriptions provided by these 
lexical entries are not complete descriptions of the articulatory actions. 
Other physical or physiological events regularly occur in these utterances but 
are not part of these descriptions. For example, larynx lowering and overall 
expansion of the oral cavity (Westbury, 1983) typically accompany the bilabial 
closure for voiced stops. However, at this early stage of our investigation, 
we wish to focus on characterizing those aspects of the physical structure 
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that are most relevant to capturing linguistic generalizations and specifying 
contrast among lexical items. In addition, these qualitative representations 
may omit certain detailed differences that are present in the quantitative 
specification of the gestures' dynamic parameter values and phasing. For 
example, the phase angle for /b/ relative to the vowel is somewhat greater 
than it is for /p/ (approximately 205 degrees vs. 180 degrees in Kelso & 
Tuller, 1985). This difference corresponds to the durational differences 
(discussed in section 1) between vowels before voiced and voiceless stops. 
Such differences in gestural parameters and phasing are directly represented 
in the more quantitative description. 

The gestural constellations in Figure 7 represent contrast in a 
phy,sically realistic way; however, the representations are clearly 
understructured when compared to recent forms of phonological representation 
We expect, however, that additional structuring will emerge as we learn more 
about patterns of intergestural phasing, e.g., as we discover whether 
relations are best captured by phasing individual gestures to one another, or 
whether there are coherent subconstellations of gestures that should be phased 
with respect to one another. As such knowledge becomes available, it will be 
possible to look for convergences between such gestural structures and 
it:-uctures hypothesized on the basis of strictly phonological evidence. For 
example, different syllable structures may correspond to different 
characteristic patterns of phasing. Again, we want to account for as much of 
phonological structure as possible in terms of the organizations required to 
describe articulatory structure. 

The phase relations among gestures are reminiscent of association lines 
among autosegments on different tiers in autosegmental and CV phonology. 
Gestural relations and autosegmental associations share the same advantage of 
allowing gestural overlap (in gestural terms) or multiple associations among 
autosegments (in autosegmental terms). From this perspective, an articulatory 
phonology and autosegmental phonology can be seen as converging on the same 
type of lexical representation. There is nothing in a gestural framework that 
contradicts autosegmental representations. Rather, autosegmental phonology 
and the present framework differ first in their starting points (phonological 
patterns vs. articulatory measurements), and second in the aspects of the 
representation that are more highly structured. In particular, the gestures 
have an explicit internal structure: they are dynamical systems that serve to 
structure the movements of articulatory subsystems. 

Thus, the gestural framework can provide a basis for making some 
principled predictions about the likelihood of a particular phonological 
feature (or set of features) being split off as a separate tier in 
autosegmental phonology. We would expect, in general, that the more 
articulatorily independent of other features a particular feature is, the more 
likely it is to become a separate autosegmental tier. In particular, we would 
predict that phonological features associated with gestures of the three 
articulatory subsystems would be the most likely to be segregated onto 
separate tiers, a view that is compatible with the prevalence of autosegmental 
analyses for nasalization and tone. Within the oral subsystem, the gestures 
involving the lips are relatively independent of the tongue gestures (although 
both sets share the jaw as an articulator). Thus, features involving lip 
gestures would be the next most likely to be segregated onto a tier. More 
generally, we expect that successive gestures specified with common motion 
variables but with different dynamic parameter values would be more likely to 
be grouped together on a separate tier than gestures with similar parameter 
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values but specified for different motion variables. Thus, we expect that 
place of articulation features (which correspond to motion variables) should 
readily split off onto separate tiers but that manner or constriction degree 
features like [continuant] or [high] (which correspond to particular values of 
the dynamic parameters) should do so more rarely. Formally speaking, there is 
no distinction in traditional feature systems between the features for place 
and manner, whereas in the gestural analysis, they correspond to distinct 
aspects of the representation. In general then, the explicit model of 
articulatory organization provided by the gestural model can lead to specific 
hypotheses about the hierarchy of expected tier independence. 

^. Concluding Remarks 

We have outlined an approach to phonological representation based on 
constellations of articulatory gestures, and explored some consequences of 
this approach for lexical organization and statements of phonological 
generalizations. In particular, we showed how working within a gestural 
framework led to simple generalizations about initial /s/-stop clusters as 
well as nasal-stop sequences in English and Chaga. We additionally discussed 
the benefits of formalizing these gestures and their relationships in terms of 
dynamical systems. In general, we suggested that constellations of 
dynamically-defined articulatory gestures can capture articulatory facts in a 
simple and elegant fashion and show promise of providing more highly 
constrained and explanatory phonological descriptions. We intend to pursue 
this line of investigation further — to develop phonological rules that refer 
to gestural structures, and to discover the range of phonological phenomena 
that can be accounted for using gestural structures ana rules. 

Even within otner phonological approaches, a gestural description of 
speech could be used as the basic data for which the phonology attempts to 
account. That is, gestural structures could replace phonetic transcriptions 
as the "output" of the phonology. There are several reasons for doing so. 
First, it is easier to verify empirically the gestural structure of an 
utterance: the relation between gestural structures and physical observables 
is simple and constrained, compared to the relation between a segmental 
transcription and speech. Second, the gestural structure incorporates 
temporal information that is usually absent from segmental transcriptions. 
This is not only important for its own sake (in accounting for cross- language 
differences, for example), but the increased resolution of the representation 
may sharpen the process of comparing competing phonological analyses. 
Finally, as we have argued above, certain aspects of phonological 
representations, such as tier decomposition, can be rationalized or explained 
with respect to such gestural structures. 
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Foocnotes 

*The notion of a gesture has been used in a somewhat similar way as a 
basic phonological unit in recent versions of dependency phonology (Ewen, 
1982; Lass, 198^*). However, our use of the term gesture is restricted to 
patterns of articulatory movement, while in dependency phonology it can refer 
to units that cannot, in any obvious sense, be defined in that way, such as 
the "categorial" gestures for major classes. 

*It might be possible to link the two statements by means of markedness 
conventions. Keating (198^1), reanalyzing Trubetzkoy's markedness theory, has 
argued that voiceless unaspirated stops tend to appear in various languages in 
positions of neutralization. 
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REPRESENTATION OF VOICING CONTRASt^S USING ARTICULATORY GESTURES* 
Louis Goldsteint and Catherine P. Browman 



The representation of cross-language voicing contrasts has been a 
recurrent problem, since the mapping between phonological categories and their 
physical pnonetic realizations is not one-to-one. Recently, Keating (198^) 
has argued that the representation of such contrasts for stop consonants must 
involve purely abstract features ([+voice] and [-voice]), which nap onto 
phonetic categories for stops based on voice onset time (voiced, voiceless 
unaspirated, voiceless aspirated) in different ways for different languages. 
However, an articulatory analysis of voicing contrasts based on the presence 
or absence of glottal opening-and-closing gestures, as suggested in Browman 
and Goldstein (1986), may well provide a more nearly one-to-one mapping 
between phonological and physical categories. Moreover, as we shall show, 
such an Ci^ticulatory analysis correctly predicts patterns of F^, behavior that 
are wrongly predicted on the basis of purely abstract voicing categories. 

Keating (198^) argues that if phonological features are constrained to be 
the same features as those used for phonetic representation, then certain 
cross-linguistic generalizations, involving voicing assimilation and 
correlations of voicing with vowel duration and pitch, will be missed. She 
demonstrates that the phonetic classes of voiced, voiceless unaspirated, and 
voiceless aspirated do not provide adequate cross- language natural classes for 
phonological rul>s. For example, both French and English have a voicing 
contrast, but different phonetic categories are involved. Whereas French (and 
sometimes English, in utterance-medial position) contrasts fully voiced 
[b,d,g] with voiceless unaspirated [p,t,k], English can contrast voiceless 
unaspirated [p,t,k] with voiceless aspirated [p^,t^,k^] (in absolute initial 
position). Thus, the phonetic categories that contrast are not the same in 
the two languages. Nevertheless, as with many other languages, the vowels in 
both French and English are longer before the phono logically [+voice] stops 
than before the phono logica lly [-voice] stops (cf. Mack, 1982). Similarly, 
cluster voicing assimilation is found in languages regardless of whether the 
stops contrast in voicing or aspiration. 

We suggest that Keating's abandonment of physically-based phonological 
featurea may in fact be unnecessary, and may simply reflect the wrong choice 
of physical descriptors. That is, a description in terms of articulatory 
gestures and their relative timing is, in fact, capable of accounting for the 
patterns Keating discusses. The basic phonological units in such an approach 



^ Journal of Phonetics , in press. 
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are articulatory gestures: organized patterns of movement within the oral, 
laryngeal, and nasal articulatory systems. Thus, as formalized in Browman and 
Goldstein (1986), voiceless stops can, to a first approximation, be 
represented as constellations of two gestures (an oral constriction ge.^ture 
tightly coordinated with a glottal opening-and-closing gesture), while voiced 
stops can be represented as single oral constriction gestures. As originally 
suggested by Lisker £jnd Abramsen (196^1), differences between aspirated and 
unaspirated voiceless stops can be captured directly by the timing between the 
two gestures in the oonstellation (or phasing: cf. Kelso & Tuller, 1985; 
Browman & Goldstein, 198b). 

Voicing contrasts . In a gestural approach, the voicing contrast in both 
French and English is described as the presence vs. the absence of a glottal 
opening-and-closing gesture. In utterance-medial position, both French and 
English [-voice] stops typically show glottal opening-and-closing gestures, 
regardless of whether they are unaspirated (French, English) or aspirated 
(English). The [+voice] stops, however, do not display glottal 
opening-and-closing gestures in either language. This correlation between 
contrastive voicing and the presence vs. absence of glottal gestures can be 
seen for French in the data of Benguerel, Hirose, Sawashima and Ushijima 
(1 978). For English, Lisker, Abramson, Cooper, and Schvey (1 969) found that 
in running speech 96 percent of stressed /ptk/, 8^ percent of unstressed 
/ptk/, and only 6 percent of /bdg/ were produced with glottal 
opening-and-closing gestures. Although the timing and size of the glottal 
gesture in English and French differ, the categorization of stops as [-voice] 
or [+volce] in utterance-medial position correlates quite well in both 
languages with the presence vs. absence of a glottal opening-and-closing 
gesture. 

In absolute initial position, the glottis is already open (for 
breathing), and the opening portion of the glottal opening-and-closing gesture 
is, therefore, not actually observed. Thus, the relevant difference between 
[+voice] and [-voice] stops in this position is in the relative timing of the 
adduction of the vocal folds. Both French /d/ (Benguerel et al., 1978) and 
English /b/ (Flege, 1982) show glottal adduction well before stop release in 
utterance-initial position. Note that this is true for English (for e.^ght of 
the ten speakers) regardless of whether there is voicing during the closure. 
That is, both phonetically voiced and voiceless unaspirated /b/ can show the 
same pattern of glottal adduction. Thus, a physical characterization using 
articulatory gestures appears to capture the voicing contrast in English and 
French for utterance-initial as well as utterance-medial position. 

^Qw^^ length . The simplest, and strongest, claim in a gestural approach is 
that vowel length differences will be correlated with the absence (longer) and 
presence (shorter) of a glottal opening-and-closing gesture. For those 
languages on which both glottal and durational data are available, the strong 
claim appears to hold up. French and English, as discussed above, are clear 
examples. Dutch (Slis & Cohen, 1969), Swedish (Lindblom & Rapp, 1973), and 
Korean (Chen, 1970) all display the vowel length difference, and available 
data suggest that the contrast for these languages can be described as the 
presence vs. absence of glottal gestures: Dutch (Slis & Damste, 1967) and 
Swedish (Lindqvist, 1972) both contrast [-voice] stops that have glottal 
opening-and-closing gestures with [+voice] stops that do not. Korean presents 
a slightly more complicated case in utterance-initial position, but in the 
intervocalic environment relevant to the vowel length rule, the same pattern 
is found (Kagaya, 197^). 
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Voicing assimilation . In a gestural approach, assimilation of a cluster in 
favor of the voiceless member can be described as a rule specifying the 
overlap of a single glottal gesture with two oral gestures. Assimilation in 
favor of thfs voiced member would be described as glottal gesture deletion. 
Thus, the gestural approach can account for voicing assimilation rules quite 
naturally. 

Fq patterns . Keating (198^) also discusses the relation between voicing 
contrasts and Fp patterns. In many languages, the Fq pattern on vowels 
following [+voice] stops is low and rising, while that following [-voice] 
stops is high and falling. Keating presents evidence from Hombert, Ohala, and 
Ewan (1979) that the F^ patterns on the vowels following stops in French and 
English are similar. The F^ following /ptk/ in either language ?hows a high 
falling pattern, while following /bdg/ it shows a low rising pattern. The F^ 
difference is thus seen by Keating as reflecting the underlying abstract 
status, rather than the phonetic realization, since in the data of Hombert et 
al. (1 979), English /bdg/ and French /ptk/ fall together phonetically as 
voiceless unaspirated. However, we can once again associate the similar 
behavior of French and English with the fact that for both languages, the 
[-voice] stops have a glottal opening-and-closing gesture while the [ + voice] 
ones do not. 



The relation between voicing and Fq in Danish provides the most intf^resting 
comparison of the abstract and gestural analyses. In Danish, there is a 
contrast in initial position between aspirated and uraspirated stops. Unlike 
other contrasts describf;d so far, however, both stops show glottal opening 
gestures (Fr^kjaer- Jensen, uudvigsen, & Rischel, 1973). (The unaspirated 
stops have smaller glottal gestures and are timed differently.; Keating's 
abstract analysis predicts that Danish should behave like English and French 
in showing a high falling F^ pattern following [-voice] stops and a low rising 
pattern following [+voice] stops, since all three languages contrast [+voice] 
stops. A gestUi'^al analysis predicts that, on the contrary, the Danish stops, 
both of which have glottal gestures, should both show high falling F^ 
patterns. The gestural analysis, therefore, predicts that Danish will be 
unlike Freruh and English, which contrast presence vs, absence of glottal 
gestures. Petersen's (1983) study of F^ following initial consonants in 
Danish supports the prediction based on the gestural analysis. The Fq 
patterns following aspirated and unaspirated stops are the same—high and 
falling (with a pitch difference averaging only 2 Hz), Moreover, the Danish 
consonants examined that do not have glottal opening-and-c losing gestures ( /v/ 
and /m/) show a low rising Fp pattern. While these results must be treated 
with some caution, as other studies of Danish have revealed larger F^ 
differences between aspirated and unaspirated stops (Jeel, 1975), nevertheless 
Petersen's study provides clear evidence for a correlation between glottaJ 
gestures and F^ patterns, rather than between abstract voicing categories and 
Fq patterns. 

Thus, analysis of cross-linguistic voicing contrasts in terms of glottal 
opening-and-c losing gestures accounts for the similarities between languages 
as well as or, in the case of Fp patterns, better than the purely abstract 
analysis posited by Keating. In addition, the articulatory analysis captures 
the facts of articulation directly, rather than requiring an additional set of 
mapping functions. 
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MAINSTREAMING MOVEMENT SCIENCE* 



J. A. S. Kelsot 



The target article by Berkenbllt, Fel'dman, and Fukson (In press) Is a 
fine synthesis of a program of research that attacks many of the Important 
Issues facing movement science. Our /lew Is that some of these Issues—local 
though they may seem to the study of movement — can be usefully viewed within a 
larger scientific context, particularly, recent developments In nonlinear 
dynamical systems and theories of cooperative phenomena In physical, chemical, 
and biological systems. Thus, the multldegree of freedom movements of animals 
and people— whose principles continue to elude us—may be couched within, or 
be extensions of, the laws underlying order and regularity In other natural 
systems. Thereby, a minimum set of principles may emerge for understanding 
utterly diverse phenomena (Maxwell, 1877). 

As others have long -eallzed (e.g., Boylls, 1975; Greene, 1971, 1982; 
Turvey, 1977), the field of control and coordination of movement has a rich 
heritage, stemming from Bernstein's Influential theories and empirical 
research. In a sense. In the West we are only beginning to appreciate 
Bernstein's legacy, an appreciation not only forced on us by emerging data on 
multldegree-of -freedom activities, but also as we get a better feel for the 
deep problems of biological motion. For example, recent work on human posture 
has demonstrated that rapid and flexible reactions occur in remote muscles 
when those activities are necessary to preserve function (e.g., stable posture 
when holding a cup of tea). Claims, however, that such effects "constitute a 
distinct and apparently new, class of motor reaction" (Marsden, Merton, & 
Morton, 1 983, p. 6i45, emphasis ours) are myopic In light of this and previous 
Russian work, and may simply reflect a Western bias (see e.g., Gelfand, 
Gurflnkel, Tsetlln, & Shlk, 1971). The old aphorism that one who is Ignorant 
of history is destined to relive it, applies also, it seems, to Insular 
attitudes in science. 

It is Interesting to note that Marsden et al.'s research on posture led 
them, by their own admission, to abandon an earlier influential servo-theory 



*The author was invited by the Editor of Tlie Behavioral and Brain Sciences to 
publish this article as a Continuing Commentary on the Berkenbllt, Fel'dman, 
and Fukson (in press) article, but declined. The present paper was 
considered too long for inclusion in the original treatment, but was 
nevertheless forwarded to the authors, who have considered it in their 
Response to Commentators (Fel'dman, personal communication). 
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of stretch reflex function. Yet Berkenbllt et al, hold tightly to the 
concept of reflex as the basis of volitional action, even though the frog's 
wiping behavior Is far from the "machine-like fatality" envisaged by 
Sherrington or the "machine-like, Inevitable reactions" of Pavlov (see 
Fearing, 1930/1970). In the first part of this comment, we advocate a requiem 
for the reflex, Sherrington's (1906) "likely, If not probable fiction" and 
"purely abstract conception." We take, along with much other evidence, the 
adaptive, context-sensitive and functionally-specific motor behavior of the 
spinal frog— beautifully shown In the experiments of Berkenbllt et al.--as 
contributing for the reflex's death-knell. In a manner consistent with 
Bernstein, we claim that the functional units of action are not anything like 
reflexes: reflexes may be elemental, but they are not fundamental In the 
sense of affording an understanding of coherent action (for a discussion of 
the elemental-fundamental distinction In modern particle physics, see Buckley 
& Peat, 1979). The refle:: Is a vestige of Descartes and Newton, of a 
machine-view of animal action, and In our view It Is time to discard It. The 
same could be said of explanations whose ontology rests on the formal machine 
concept, that Is, the motor program and Its neurally-based counterpart, the 
central pattern generator (CPG). But that Issue has been addressed elsewhere 
as Berkenbllt et al. note (Footnote 1, see also Kelso, l98l, In press; 
Selverston, 1980, and commentaries). 

The second part of this commentary addresses two central issues lucidly 
demonstrated and discussed by Berkenbllt et al.: (1) the capability of a 
tremendously complex system possessing a huge number of degrees of freedom to 
"simulate" a simple, knowable system like a mass-spring; and (2) relatedly, 
the system's capability to achieve the same microscopic product (e.g., wiping) 
with a variety of different effectors, in the face of perturbations and 
changes in initial conditions, and through a (potentially infinite) number of 
trajectories. As Berkenbllt et al. note, this "constancy" hid parallels in 
perception and even in morphogenesis. The reproducibility of functional 
behavior in spite of much variability in the "reflex" itself and in the 
components that contribute to it Is indicative of what the biologist would 
call structural stability, that is, a pronounced invariance in form and 
function against spatial or temporal deformations (e.g., Thom, 1975; Thompson, 
1917; Weiss, 1969). These facts of action and perception (not principles, 
mark you— phenomena to be understood) can be brought under a common dynamical 
framework, although here we can only hint at its general features. To some 
extent, this Involves linking the work of Berkenbllt et al., with that of 
their colleagues who study regular and stochastic motion in simple and 
multidegree of freedom dlsslpative systems (e.g., Andronov & Chaikin, 19^9; 
Arnol'd, 1978)— a field in which there is currently tremendous interest (e.g., 
Feigenbaum, 1980: Grassberger & Procaccla, 1983; Haken, 1975> 1983). 

Requiem for the Reflex ? 

The idea that voluntary movement is constructed from reflexes (innate 
patterns) and ultimately effected by reflex parameterization is not new 
(cf. Fearing, 1930/1970), although the particular mechanism envisaged by 
Berkenbllt et al. may be. Much confusion has arisen in physiology and 
psychology over the usage, meaning, and assumptions underlying the concept of 
reflex. It would not be too hard to document, in the fifty-five years 
following Fearing's brilliant historical and critical treatise on the reflex, 
the same conceptual pitfalls that he detailed. 
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A chief source of confusion rests on the assumption that simplicity of 
anatomy dictates slmoliclty of function, and vice-versa. The reflex arc Is 
something with which wo are all familiar, and the step toward understanding 
stimulus-response behavior must have seemed a natural one. In fairness, 
Sherrington (1906) was wary of such interpretative ease; near the end of his 
Silliman lectures, he stressed the importance of understanding the interaction 
of reflexes and volitional control. In our view, rrington was in a bind. 
On the One hand, he could map neurophysiologicall^ .ne "reflex machinery," the 
wiring diagram, in certain simple cases (e.g., spinal preparations) and thus 
relate anatomy to function (e.g., the scratch reflex, the stepping reflex). 
On the other hand, as a self -professed Darwinian, he recognized that "the 
difficulty in assigning purpose to a particular reflex is hazardous and 
inversely proportional to the field covered by i-he reflex effect" 
(Sherrington, 1906, p. 239). In our view, the difficulty lay in Sherrington's 
belief that reflexes, by definition, were hard-wired entities. 

Bernstein (1928/1967) took a very different tack from Sherrington. For 
Bernstein, movement was hypothesized to be "a living morphological object," 
not "chains of details but structures which are differentiated into details" 
(p. 67). The identity of movement with emerging form meant that changes in 
one single detail of a movement could lead to "a whole series of others which 
are sometimes very far removed fran the former both in space and time" 
(p. 69). For Bernstein, a perturbation to the "motor field" was felt by the 
field as a whole in such a way as to preserve integrity of system function. 
It was the form or topology of action that was preserved. This is the essence 
of the coordinative structure construct (Berki^;;blit et al., in press. Section 
3), by definition, an ensemble of neuranuscular components temporarily 
assembled as a functionally-specific unit. The remarkable adaptability and 
variability in the spinal frog's wiping behavior is characteristic of a 
coordinative structure, not a reflex— at least by any conventional definition. 

A coordinative structure oi^ganization—as seen in the spinal frog~is 
apparent in many different activities attesting further to Berkenblit et al's 
intuition that nature operates with ancient themes. But in our view, it is 
not so much that higher levels exploit innate patterns as it is that 
coordinative structures are evident at every level of motor system description 
and across phyletic strata. This is because functions, not reflexes, are 
evolutionary primitives. For example, in the case of spee ,, a so-called 
"higher level" activity, an unexpected perturbation to the ^^w during upward 
motion for final /b/ closure in the utterance /baeb/ reveals near-^immediate 
changes in upper and lower lip muscles and movements (15-30 ms), but no 
changes in tongue muscle activity. The same perturbation applied during the 
utterance /baez/ evokes rapid and increased tongue muscle activity for /z/ 
frication, but no active lip movement (Kelso, Tuller, & Fowler, 1982; Kelso, 
V. Bateson, & Fowler, 198^). Note that the form of interart iculator 
coordination is neither randc?^^ nor hard-wired, but unique and specific to the 
phoneme produced. That a cnallenge to one member of a group of potentially 
Independent articulators is met — on the very first perturbation experience— *)y 
remotely (but not, note, mechanically) linked members of the group, provides 
strong support for coordinative structures as tne meaningful units of 
behavioral action, regardless of anatomical "level." Though such adaptive 
behavior could, because of its speed, be described as reflexive, its 
mutability speaks against any kind of reflex organization. 
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To recognize the c':oordlnat Ive structure as Greene's (1971) "significant 
informational unit" Is not merely a plea for a change in terminology. It is 
to underscore the "soft," flexible nature of a unit of action, and to take us 
away fron the hard-wired language of reflexes and CPGs or the hard-algori thmed 
language of computers (formal machines), which are the source of the motor 
program/CPG idea. In place of such machine metaphors, the coordinative 
structure construct emphasizes the analytic tools of qualitative (nonlinear) 
dynamics (e.g., Abraham & Shaw, 1982; Arnol'd, 1978) and the physical 
principles of cooperative phenomena in nonequilibrium, open systems (e.g., 
Haken, 1975). It is this "equipment" that may, on the one hand, provide a 
principled account of the phenomena discussed by Berkenblit et al. (in press) 
and, on the other, bring the study of biological motion into the mainstream of 
theoretical science. 

Constancies, Motor Equivalence and Attractors 

Berkenblit et al. ( in press) ask: How go different movement 
trajectories, with different effectors and in the fact> of changing contextual 
conditions, manage to accomplish the same goal? Similarly, for the case of 
perceptual constancies, one can inquire: How do different retinal images 
yield the same p-rrcept? Note that in each case the number of microscopic 
degrees of freedom is enormous (e.g., the neurons, neuronal connections, 
muscle fibers involved in lifting a finger or the light rays to the eye, the 
retinal mosaic, and neural processing structures involved in perceiving an 
object).. Yet somehow, this high dimensionality gets "compressed" into a lower 
dimensional subspace. How such compression is realized is the challenge 
faced, not only by students of action and perception, but in other realms of 
science as well. For example, chemistry asks how low-d^menslonal behavior, 
such as periodicity, arises in the Belousov-Zhabat inskii . eaction even when 
thirty or more chemical species are present in th^ reaction vessel (e.g., 
Shaw, 1981). In the case of movement control and perception, a key may lie in 
the Identity between the flow of a dynamical system (as reflecting, say, the 
self-equilibrating characteristic of a complex, multldegree of freedom motor 
system) and the flowing optic array described originally by Gibson (1950). In 
the former case, the flow is rei^resented in the qualitative shapes or forms of 
motion observed in the system's phaso portra.^t, that is, the totality of all 
possible phase plane trajectories generated by a particular dynamical system 
under a given parameterization. In the latter case, the visual flow is 
equivalent to optical structure (defined in terms of optical motion vectors 
rather than Euclidean Images, see e.g., Johansson, 1977) that is lawfully 
generated by the environmental layout of surfaces and by the movements of 
animals (see e.g., Gibson, 1950, Chapter 7). In each case the relevant 
parameter? are found to be macroscopic and low-dimensional. 

Tasks like wiping off a noxious stimulus or reaching for a cup yield 

patterned forms of motion characteristic of point attractor dynamics, a 

generic category that denotes the fact that all trajectories on the phase 

portrait flow to an asymptotic equilibrium state (a basin of attraction). It 

is Important to realize that mult Idegree-'of -freedom systems whose trajectories 

converge to a stable position can also be described in the low-dlmens Tonal 

language of point attractor dynamics. In the context of movement, this is 

because the system is dlsslpatlve, that is, there is a contraction (not a 

conservation as in Hamilton Ian systems) of phase space volume onto a surface 

of lower dimensionality than the original space. Other kinds of attractors 

corresponding to stable, steady-state motions in N-dlmenslonal systems are 
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periodic attractors or limit cycles, which are capable of characterizing 
rhythmical tasks like chewing, locomotion and perhaps speaking (e.g., Kelso, 
V.-Bateson, Saltzman, & Kay, 1985). Moreover, given the presence of chaos 
even In simple deterministic dynamical systems (e.g., Felgenbaum, 1980), 
chaotic attractors In movement are not unlikely. Here we see the beginning of 
a way to conceptualize and model how an extremelv complex system becomes 
controllable as low-dlmenslonal dynamics. 

This Is oovlously only a small part of a big story. Berkenbllt et al. 
(In press) refer In several places to critical behavior (e.g., the critical 
hip phase angle for Initiating the loccmotory swing phase) and bifurcations 
( e. g. , In terms of switching among traj ectory subcorrtponents, abrupt 
modifications of movement pattern). As they note, although such phenomena are 
well known (If under-recognized) In movement, their lawful basis Is not 
understood. Certain theoretical programs that deal explicitly with pattern 
formation and change (e.g., Haken, 1975; Nlcolls & Prlgoglne, 1977) suggest a 
basis for understanding these and other phenomena. In synergetics, for 
example, near regions of Instability (I.e., before qualitative shifts In 
pattern occur) the system's behavior can be completely specified by one or a 
fQw onior paraneters (Haken, 1975, 1983). Such order parameters are created 
by the cooperation among the Individual components of a complex system, and 
th*?.y in burn govern the behavior of these components. They therefore afford, 
in principle, a linkage between macro- and mlcrolevels of description. Using 
concepts of synergetics and nonlinear oscillator theory, Haken, Kelso, and 
Bunz (1985) have offered an explicit theoretical model of phase transitions In 
bimanual activity (Kelso, 198^) that should have general applicability to the 
kinds of critical phenomena and bifurcations discussed by Berkenbllt et al. 
(In press). The theoretical strategy employed by Haken et al. may be worth 
noting. First, they specify the layout of attractor states characterizing the 
stable bimanual modes and show how, under the Influence of continuous scaling 
on a control parameter, the layout changes — at a critical value — from one 
attractor to another. Then they derive this scenario and other features of 
the data (Kelso, 198^) from the equations of motion of each hand and a 
nonlinear coupling betv/een them. Recently, Kelso and Seholz (1985) have 
verified several novel predictions of an extended version of the model 
(Sch^ner, Haken, & Kelso, 1986), Including the existence of critical slowing 
down In order parameter behavior as the transition approached, and enhanced 
fluctuations In order parameter behavior near the bifurcation region. Such 
predictions and results would hardly be expected from conventional motor 
program/CPG accounts of "switching" behavior, for example, gait changes 
(of. Grlllner, 1982, p. 22^; Schmidt, 1982, p. 316). 

The present framework may apply not just to biological motion, per se, 
but to the perception-action system as a total unit. Elaboration of Gibson's 
work on visual flow fields for example (see e.g., Lee, 1980; Lee & Reddish, 
1981) reveals how the rate of dilation, T(t) of a bounded region of optical 
structure specifies the time at which a moving object will contact a surface. 
(Note: the ratio of retinal expansion velocity and retinal size Is equivalent 
to the Inverse of t). Files, for example, have been demonstrated to begin to 
decelerate prior to 3urface contact at a critical value of the Inverse of t 
(Wagner, 1982). Thus, not only does the t parameter and Its rate of change, 
provide continuous Information for modulating ongoing activity, but at 
cert.^ln critical points, the system exhibits bifurcations to adaptive modes of 
beh^ lor as well. In this view, then. Information for the perception-action 
system Is specified In the morphology of the flow field (Gibson, 1950, Ch. 7). 
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The flow field geometry Is defined In terms of flow vectors to and frcr^ a 
focus of expansion, which can be conceived as basically an attractor or 
repellor. As the facts of motor equlvalence/equlf Inallty tell us, attractors 
and their layout must be defined In terms of their significance or 
meanlngfulness for the perception-action system, (i.e.. In Berkenbllt et al., 
in press, the "reflex" Is variable, but the goal achievement is not). Thus, a 
further consequence of the present framework, to which w- can only allude 
here, is a dynamic information theory (see e.g., Haken, igS^O-'one in which 
Information is not viewed in the classical Shannonlan sense, as a measure for 
scarcity of a message or ignorance regarding systemic states (i.e., as 
receiver-independent), but rather as carrying its own semantic content for the 
receiver. 

In conclusion, there is reason to suppose that an understanding of ^he 
multidegree of freedom activities of animals and peoole Tails squarely on the 
Shoulders of an emerging theory of cooperativity ar:d pattern formation in 
open, complex systems. If so, we conclude where we t-i-\n, namely that many of 
the phenomena • autifully treated by Berkenbllt et al. (in press) may not be 
"special" to movement science, and thus may ■•.>:t ; equire "special" concepts 
beyond those developed from first principles. The vIok that we are pursuing 
is that biological motion is an important best field for the essential 
elaboration of these basically physical (but, mark, non-Newtonian and 
nonmechanlcal) Ideas. 
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